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The  purpose  of  this  research  is  twofold:  one  is  to  develop 
quantitative  measures  from  the  speech  and  the  electroglot tography 
(EGG)  signals  for  the  assessment  of  laryngeal  function;  the  other  is 
to  extract  features  from  the  voiced  speech  signal  for  speaker 
identification. 

The  linear  predictive  coding  (LPC)  provides  a good  parametric 
representation  of  the  speech  waveform  and  can  be  considered  to 
reflect  changes  in  the  speech  signal  due  to  the  laryngeal 

dysfunction.  The  EGG  signal  is  known  to  represent  the  vocal  folds' 
vibratory  pattern  with  good  fidelity. 

In  this  work,  we  carried  these  ideas  further  and  developed  two 
methods  for  the  detection  of  a laryngeal  pathology:  1)  a spectral 
distortion  measure  for  the  pitch  asynchronous  LPC  vectors  using  the 
vector  quantization  (VQ)  technique  and  2)  the  perturbation  analysis 


of  the  EGG  signal  with  a set  of  time  interval  and  amplitude 
difference  measurements. 

In  a closed  threshold  test  for  29  pathological  and  52  normal 
subjects,  these  two  methods  gave  75.9  and  69.0%  probability  of 
detection  for  the  pathological  subjects  with  a 9.6%  probability  of 
false  alarm  for  the  normal  subjects.  In  the  discriminant  analysis 
with  a "leave-one-out"  method,  both  methods  resulted  in  a 69.0% 
probability  of  detection  with  a 7.7%  probability  of  false  alarm.  In 
the  analogous  tests  for  the  pitch  period  and  amplitude  perturbation 
of  the  EGG  signal,  the  probability  of  detection  was  approximately 
10%  lower  than  that  of  the  two  methods  we  developed. 

We  proposed  a speaker  identification  scheme  using  the  speaker- 
based  VQ  codebook  of  the  sustained  vowel.  With  the  pitch 
synchronous  LPC  vector  of  the  sustained  vowel  as  a feature  vector, 
the  VQ  codebook  size  of  4 was  found  to  be  suitable  to  characterize 
each  speaker's  feature  space.  For  40  normal  speakers  (20  males,  20 
females),  we  achieved  the  correct  identification  rate  of  99.4%  from 
the  training  data  set,  and  89.4%  from  the  test  data  set  with  speech 
samples  of  50  pitch  periods.  It  was  shown  that  the  duration  of  the 
test  speech  samples  (number  of  the  test  vectors)  did  not  affect  much 
the  identification  performance. 
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CHAPTER  ONE 


INTRODUCTION 

Sound  Sources  and  Speech  Production 
The  speech  waveform  is  an  acoustic  pressure  wave  that 

originates  from  voluntary  physiological  movements  of  the  human  vocal 
structures  shown  in  Figure  1-1.  The  larynx,  supported  by  ligaments 
and  controlled  by  muscles  activated  by  numerous  nerves,  is  composed 
of  soft  tissues  encased  in  a cartilaginous  skeleton.  A mucous 

membrane  lines  the  larynx  as  it  does  the  trachea.  Here  we  will  not 

describe  in  detail  the  anatomy  and  physiology  of  the  human  speech 

production  system.  Instead  the  reader  is  referred  to  Moore  [1971] 
and  Zemlin  [1968].  We,  however,  are  concerned  with  the  basic 
understanding  of  the  human  speech  production  mechanism. 

One  of  the  biological  functions  of  the  larynx  is  to  provide  a 
protective  closure  for  the  respiratory  system.  In  addition,  another 
important  function  is  performed  by  the  vocal  folds,  namely,  to 
modulate  the  stream  of  air  passing  between  them  during  respiratory 
activity.  The  lungs  are  typically  considered  as  air  reservoirs, 
which  are  capable  of  expelling  air  up  the  trachea  to  the  vocal 
folds.  For  voiced  sounds  such  as  vowels  the  pressure  increases 
until  the  folds  are  pushed  apart,  forming  a slit  that  is  known  as 
the  glottis.  Then  a puff  of  air  passing  through  this  glottal 
opening  sets  the  vocal  folds  into  vibratory  motion. 


1 


2 


Figure  1- 


. Sketch  of  the  human  vocal  structure 
[Childers,  1977,  p.  378 ] 
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According  to  the  aerodynamic-myeloelastic  theory  [Berg  et  al., 
1957;  Berg,  1958]  of  phonation,  this  vibratory  motion  of  the  vocal 
folds  is  directly  attributed  to  the  air  flow  through  the  glottis  and 
is  the  result  of  the  interplay  of  two  forces.  One  is  the  subglottal 
pressure,  which  causes  air  to  push  the  adducted  folds  apart, 
releasing  a puff  of  air  through  the  glottis.  The  other  force  is  the 
Bernoulli  effect.  When  the  air  velocity  passing  through  the  glottis 
is  relatively  large,  it  results  in  a drop  in  pressure  across  the 
folds,  producing  a suction  effect  that  pulls  the  folds  back  together 
and  adducts  the  vocal  folds,  closing  the  glottis.  The  succession  of 
air  pulses  generated  as  a result  of  this  vibratory  motion  of  the 
vocal  folds  sets  up  an  acoustic  wave  that  travels  through  the  vocal 
and  nasal  tract  cavities. 

The  phonation  or  sound  generated  has  an  auditory  correlate,  or 
a pitch,  that  is  acoustically  measured  as  a fundamental  frequency  of 
the  voice.  The  frequency  of  oscillation  of  the  vocal  folds  is 
determined  by  their  mass,  length,  thickness,  elasticity,  and 
compliance  as  well  as  by  the  subglottal  pressure.  For  example,  the 
greater  the  tension,  the  higher  the  perceived  pitch.  The  subglottal 
air  pressure  and  the  time  variations  in  glottal  area  determine  the 
volume  velocity  of  the  glottal  air  flow  or  glottal  wave,  expelled 
into  the  vocal  tract.  It  is  the  glottal  wave  that  defines  the 
acoustic  energy  input  or  driving  function  to  the  vocal  tract. 

The  vocal  tract  is  a nonuniform  acoustic  tube  which  extends 
from  the  glottis  to  the  lips  and  varies  in  shape  as  a function  of 
time.  The  major  anatomical  components  causing  this  time-varying 
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change  are  the  lips,  jaw,  tongue,  and  velum.  The  primary  element 
controlling  the  vocal  tract  is  the  tongue,  which  divides  the  vocal 
tract  into  two  resonant  cavities,  the  pharynx  and  mouth,  which,  in 
turn,  determine  the  transmission  characteristics  of  the  vocal  tract. 
These  characteristics  can  also  be  modified  by  coupling  the  nasal 
cavity  to  the  oral  cavity  under  the  control  of  the  soft  plate  or 
velum.  During  nonnasal  sound  generation,  the  velum  closes  off  the 
nasal  cavity  from  the  vocal  tract.  The  nasal  cavity  constitutes  an 
additional  acoustic  tube  for  sound  transmission  used  in  the 
generation  of  nasal  sounds  /n/,  /m/,  /X\/ , as  in  run,  rum,  and  rung, 
respectively. 

The  vocal  tract  transmission  characteristic  is  such  that  it 
causes  certain  frequency  components  of  the  glottal  wave  to  pass  with 
less  attenuation  than  others.  An  example  of  this  frequency  transfer 
characteristic  is  shown  in  Figure  1-2.  The  resonances  or  peaks  in 
this  spectrum  are  referred  to  as  formants,  and  the  center 
frequencies  of  which  are  designated  in  their  ascending  order  of 
appearance  as  FI,  F2,  etc.  Each  formant  also  has  a bandwidth.  The 
first  two  or  three  formants  are  generally  sufficient  for  the 
perceptual  characterization  of  most  voiced  vowels  and  consonants  in 
English.  The  higher  formants  are  thought  to  be  important  for 
natural-sounding  or  good  quality  speech. 

A similar  formant  structure  is  also  observed  for  unvoiced 
sounds.  The  shape  of  the  vocal  tract  filters  the  acoustic  pressure 
wave  that  is  radiated  finally  as  a pressure  waveform  at  the  lips. 


AMPLITUDE  (DB) 
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FI 


Figure  1-2.  Frequency  transfer  characteristic  of  the  vocal  tract. 

The  phonation  was  /u/  produced  by  a normal  male  subject. 
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For  unvoiced  sounds,  however,  the  vocal  folds  are  held  apart  so  that 
air  expelled  from  the  lungs  up  the  trachea  is  unaffected  by  glottal 
vibrations.  There  are  two  fundamental  types  of  unvoiced  sounds: 
fricatives  and  plosives.  A typical  example  of  the  fricative  sound 
is  the  sound  /s/,  which  is  produced  by  expelling  air  through  a 
constriction  such  as  the  teeth  to  produce  turbulent  air  flow.  An 
example  of  the  plosive  sound  is  /p/  as  in  puff,  and  is  created  by 
the  rapid  release  of  air  pressure  built  up  behind  a closure  such  as 
the  lips. 

Objective 

A human  being  is  able  to  categorize  patterns  of  the  acoustic 
signal  into  linguistic  information,  information  concerning  the 
speaker's  identity  and  the  speaker's  personality,  and  we  can  often 
discern  the  speaker's  emotional  state,  age,  dialect,  and  health. 
Although  progress  is  being  made  on  isolated  word  recognition 
[Jelinek,  1985;  Hiraoka  et  al.,  1986;  Martin  et  al.,  1988], 
continuous  speech  recognition  [Bahl  et  al.,  1983;  Chen  and  Zue, 
1986;  Murveit  and  Weintraub,  1988],  and  speaker  recognition 
[Doddington,  1985;  O'shaughnessy,  1986;  Soong  et  al.,  1987;  Soong 
and  Rosenberg,  1988]  systems,  automatic  speech  and  speaker 
recognition  procedures  are  far  from  accomplishing  such  feats.  It  is 
because  so  little  is  known  about  the  acoustic  features,  vocal 
differences,  prosodic  variations  such  as  phoneme  durations,  pitch 
contours,  speaking  rate  and  stress,  and  other  factors  which 
influence  our  ability  to  determine  the  following: 
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. what  was  said 
. speaker's  identity 
. sex  of  the  speaker 
. dialects/accent  of  the  speech 
. emotional  state  of  the  speaker 
• age  of  the  speaker 
. voice  disorders 

From  this  standpoint,  the  purpose  of  this  research  is  to 

analyze  a two-channel  (Speech  and  EGG)  signal  of  voiced  sounds,  such 

as  vowels,  in  order  to  extract  features  that  will  capture  the 

information  about  the  following  voice  characteristics: 

. voice  disorders  (detection  of  a laryngeal  pathology) 

. who  was  the  speaker?  (speaker  identification) 

We  investigated  the  speech  and  EGG  signals  to  extract  some 

useful  features  for  the  evaluation  of  laryngeal  function,  i.e.,  for 

the  detection  of  voice  disorders.  The  adequateness  of  these 

extracted  features  were  examined  by  various  ways  such  as  histograms, 

crosscorrelations,  classification  tests,  etc.  We  also  did  the 

speaker  identification  experiment  with  the  pitch  synchronous  LPC 

(Linear  Predictive  Coding)  vectors  obtained  from  the  sustained  vowel 

phonation.  The  effect  of  different  speaker  identification 

parameters  on  the  performance  was  examined. 

Description  of  Chapters 

Chapter  2 is  comprised  of  three  sections.  A review  of  methods 
for  the  evaluation  of  laryngeal  function  is  contained  in  the  first 
section.  Considerable  attention  was  given  to  a description  of  the 
vocal  folds'  vibratory  pattern  with  the  EGG  signal  and  a description 
of  acoustic  speech  signal  analysis.  In  the  second  section,  we 
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briefly  review  existing  methods  for  speaker  identification.  This 
chapter  concludes  with  a research  rationale  of  this  dissertation. 

Chapter  3 contains  a discussion  of  techniques  used  for  speech 
analysis,  i.e.,  linear  prediction  and  vector  quantization  (VQ).  We 
used  a modified  form  of  the  I takura-Sai to  distortion  measure  with 
LPC  vectors  for  the  spectral  distortion  analysis  and  speaker-based 
VQ  codebook  design. 

Chapter  4 is  comprised  of  three  sections.  A description  of  a 
data  acquisition  system  and  the  data  base  we  collected  are  given  in 
the  first  section.  The  selection  of  the  LPC  order  and  some  examples 
of  the  LPC  spectra  for  the  normal  and  pathological  subjects  follow 
in  the  second  section.  The  last  section  contains  a selection  of  the 
codebook  size  with  examples  of  the  VQ  analysis  for  the  normal  and 
pathological  subjects. 

Experimental  procedures,  results,  and  a discussion  of  the 
assessment  of  laryngeal  function  are  discussed  in  Chapter  5.  From 
the  speech  signal,  we  analyzed  the  LPC  spectral  distortion  using  VQ 
technique.  A perturbation  analysis  of  various  segments  and  events 
of  the  EGG  signal  with  our  algorithm  is  discussed  in  the  following 
section.  Chapter  5 concludes  with  an  evaluation  of  our  analysis 
methods  for  the  misclassif ied  subjects  and  subjects  who  were 
undergoing  vocal  treatment. 

Chapter  6 contains  the  speaker  identification  schemes  that  we 
proposed,  with  their  experimental  results  and  discussion.  The  pitch 
synchronous  LPC  vectors  were  used  as  feature  vectors  for  speaker 
identification.  In  the  second  section,  our  speaker  identification 
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scheme  is  described.  The  experimental  results  and  discussion  are 
given  in  the  last  section  of  this  chapter. 

Finally,  Chapter  7 summarizes  the  conclusions  of  this  study  and 
gives  some  suggestions  for  further  studies  of  this  work. 


CHAPTER  TWO 


REVIEW  OF  CONVENTIONAL  TECHNIQUES 

Detection  Methods  of  Laryngeal  Pathologies 
Voice  disorders  are  defined  by  Moore  as  "deviations  in  pitch, 
loudness  and  quality  that  are  judged  to  be  atypical  of  the  voice 
characteristics  of  most  persons  having  the  same  age,  sex,  and 
cultural  background"  [Moore,  1971,  p.  2).  Injury  or  disease  may 
cause  voice  disorders  that  are  frequently  perceptible  in  the  speech 
of  the  affected  individual.  On  the  other  hand,  a voice  disorder  may 
not  cause  an  acoustically  perceptible  change  in  the  individual's 
speech,  even  to  an  experienced  speech  pathologist  or  clinician. 

Voice  disorders  can  be  classified  into  two  categories: 
functional  and  organic.  Functional  disorders  are  associated  with 
incorrect  use  or  misuse  of  an  otherwise  healthy  and  nondefective 
vocal  organ.  The  incorrect  use  of  vocal  folds  is  sometimes  related 
to  psychological  problems.  However,  in  some  cases,  misuse  is 
intentional  to  produce  a unique  quality  of  voice  as  in  the  case  of 
actors  or  singers.  Some  of  the  diseases  associated  with  a 
functional  disorder  are  paralysis,  and  myasthenia  larynges,  meaning 
a laryngeal  muscle  without  strength.  Organic  disorders  result  from 
organic  changes  to  the  structure  of  the  vocal  folds.  The  organic 
lesions  include  tumors,  polyps,  vocal  nodules,  papilloma,  cancer, 
etc.  It  is  interesting  to  note  that  functional  misuse  of  the  voice 
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such  as  excessive  shouting  at  a football  game  may  result  in  an 
organic  disorder  due  to  the  development  of  a nodule. 

Other  factors  that  affect  the  motion  of  the  vocal  folds  are 
laryngeal  webs,  contact  ulcers,  irritating  substances,  surgical 
intervention,  and  non-localized  diseases  such  as  allergies  or 
hormonal  influences  [Moore,  1971].  Laryngitis,  probably  the  most 
well-known  of  the  laryngeal  diseases,  has  many  causes  which  include 
smoking  and  alcoholism  as  well  as  vocal  abuse.  The  symptoms  are 
breathiness  and  hoarseness.  Hoarseness  is  perhaps  the  most 
frequently  recognized  symptom  of  a laryngeal  pathology  or  disorder. 

Many  methods  exist  for  detecting  and  evaluating  voice 
disorders.  The  vast  majority,  however,  have  been  used  only  in 
research.  They  are  complicated  and  usually  are  not  readily 
available  in  voice  clinics.  The  development  of  screening  procedures 
for  the  detection  of  laryngeal  disorders,  particularly,  early 
detection  with  a simple  signal  analysis  technique,  is  still  strongly 
needed.  We  now  turn  our  attention  to  the  methods  used  to  either 
detect  or  evaluate  voice  disorders.  We  are  going  to  use  a 2-channel 
signal  (Speech  and  EGG)  for  the  analysis,  giving  more  attention  to 
both  acoustic  analysis  and  EGG  analysis  methods  in  our  review. 

Aerodynamic  Tests 

The  following  four  parameters  are  usually  used  in  the 
aerodynamic  test:  subglottal  pressure,  supraglottal  pressure, 
glottal  impedance,  and  volume  velocity  of  the  air  flow  at  the 
glottis,  or  glottal  volume  velocity.  While  many  measurements  in 
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subglottal  pressure  and  supraglottal  pressure  have  been  made  [ Hikl 
et  al.,  1970;  Koike,  1981;  Cranen  and  Boves,  1985],  few  attempts 
have  been  made  to  measure  the  glottal  volume  velocity  directly. 
Instead  of  direct  measurement,  considerable  research  has  been 
conducted  in  an  attempt  to  estimate  this  parameter  from  the  speech 
signal.  We  will  discuss  this  later  in  the  section  dealing  with  the 
acoustic  analysis  method. 

The  average  subglottal  pressure  has  been  measured  directly  with 
a pressure  transducer  introduced  into  the  subglottal  region  through 
a tracheal  puncture  and  can  be  measured  by  lowering  a transducer 
through  the  glottis  into  the  trachea  [Koike  and  Hirano,  1973;  Cranen 
and  Boves,  1985].  Researchers  reported  an  increase  in  subglottal 
pressure  for  patients  with  carcinoma  and  recurrent  laryngeal  nerve 
paralysis  [Hiroto,  1966;  Kuroki , 1969].  The  subglottal  pressure 
along  with  the  supraglottal  pressure  and  the  glottal  resistance  can 
be  used  to  calculate  the  mean  flow  rate.  It  is  usually  measured 
using  devices  such  as  pneumotachographs  and  spirometers.  The  mean 
flow  rate  is  generally  greater  for  pathological  subjects  than  the 
normal  subjects  [Hirano  et  al.,  1968;  Iwata  et  al.,  1972],  and,  in 
general,  it  can  be  used  to  monitor  progress  in  treatment. 

Many  of  these  procedures  involve  some  risk  and  discomfort  to 
the  subject.  It  is  unlikely  that  in  the  near  future  these  methods 
will  be  used  extensively  to  detect  the  laryngeal  pathologies, 
despite  the  fact  that  abnormal  air  flow  rate  has  been  used  in  the 
diagnosis  of  laryngeal  pathologies  [Childers,  1977]. 
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Examination  of  the  Vocal  Folds'  Vibration 

Normal  vibration  of  the  vocal  folds  is  a prerequisite  for  a 
normal-sounding  voice.  Moore  [1976]  and  Moore  et  al.  [1985]  have 
observed  that  the  factor  which  determines  the  phonatory  quality  of 
the  resulting  sound  is  not  the  disease  itself  but  the  vocal  folds' 
vibratory  patterns.  This  is  why  there  exists  nonuniqueness  between 
perceptual  acoustics  of  the  sound  and  the  underlying  pathology  if 
different  pathologies  result  in  similar  folds'  vibratory  patterns. 

Laryngeal  function  is  primarily  observed  indirectly  with  the 
aid  of  a laryngoscope  or  mirror.  Indirect  laryngoscopy  is  simple  to 
use,  inexpensive,  and  requires  no  elaborate  equipment  or  data 
processing.  However,  the  procedure  provides  no  permanent  record 
useful  for  medical  history  or  therapy.  Furthermore,  when  the 
laryngoscope  is  used  in  a subject  who  can  tolerate  the  instrument, 
the  subject  is  able  to  phonate  only  very  limited  sounds.  In 
addition,  since  the  vocal  folds  vibrate  so  rapidly,  the  viewer  is 
not  able  to  follow  their  detailed  motion  by  use  of  laryngoscopy 
alone.  In  order  to  overcome  the  disadvantages  of  indirect 
laryngoscopy,  many  methods  exist  for  observing  the  vibration  of  the 
vocal  folds.  These  techniques  include  stroboscopy,  ultra-high-speed 
cinematography,  photoelectric  glottography,  electroglot tography , and 
ultrasound  glottography. 

Stroboscopic  examination  of  the  vibrating  folds  allows  the 
examiner  to  freeze  the  image  of  the  vocal  folds  at  the  same  position 
in  the  vibratory  cycle  when  the  strobe  light  is  synchronized  with 
the  folds'  vibratory  cycle  [Fog-Pedersen,  1977].  These  photographs 
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allow  evaluation  of  the  size,  position,  shape,  orientation,  and 
color  of  the  laryngeal  structure,  but  they  do  not  show  fine  details 
of  the  vibratory  cycle.  The  primary  limitation  of  this  method  is 
that  the  investigator  sees  only  brief  "snapshots"  of  consecutive 
cycles  of  the  vibratory  pattern,  and  he  never  sees  a complete 
vibratory  cycle.  Despite  the  fact  that  it  possesses  the  general 
limitations  of  indirect  laryngoscopy,  stroboscopic  examination  has 
been  and  is  still  used  in  a clinical  setting.  This  technique, 
however,  will  probably  not  be  used  in  the  near  future  since 
laryngeal  research  has  reached  a higher  level  of  sophistication. 

While  regular-speed  cameras  operate  at  24  frames  per  second, 
ultra-high-speed  photography  is  capable  of  filming  speed  from  4000 
to  10000  or  higher  frames  per  second.  The  film  produced  allows  a 
study  of  the  vibration  of  the  vocal  folds  in  detail  during  a glottal 
cycle.  With  the  exception  of  the  work  using  fiberoptic  bundles, 
ultra-high-speed  photography  uses  an  indirect  laryngoscope, 
typically  employing  a laryngeal  mirror.  Either  black  and  white  or 
color  films  can  be  used.  The  details  of  the  procedure  appear  in 
numerous  papers  [Farnsworth,  1940;  Moore  et  al.,  1962].  The 
advantages  of  this  technique  are  that  the  vibratory  pattern  of  the 
vocal  folds  can  be  viewed  in  slow  motion,  and  that  detailed  frame- 
by-frame  analysis  can  be  made.  However,  there  are  also  numerous 
problems  and  disadvantages  to  this  procedure  [Childers,  1977].  This 
technique  can  only  be  implemented  in  research  laboratory  since  it  is 
too  complex  and  expensive  for  clinical  use. 
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The  techniques  to  be  discussed  below  are  also  indirect  methods 
for  monitoring  laryngeal  function.  These  procedures,  however,  have 
been  used  and  are  being  used  because  they  are  simple  and  overcome 
many  disadvantages  of  the  observational  methods  discussed  above. 
These  procedures  also  have  their  limitations  and  the  data  obtained 
are  generally  related  only  to  glottal  area  or  closure. 

Photoglot tography  (PGG),  also  called  photoelectric  or  optical 
glottography,  attempts  to  determine  the  glottal  area  function  by 
measuring  the  amount  of  light  that  passes  through  the  glottis 
[Sonesson,  1959;  Fant  and  Sonesson,  1962].  The  light  source  may 
either  be  above  the  vocal  folds,  introduced  through  the  mouth  or 
nasal  passages,  or  introduced  through  the  neck  below  the  larynx.  As 
the  vocal  folds  vibrate,  the  light  is  modulated  and  picked  up  by  the 
photosensor  whose  electrical  output  may  be  displayed  on  an 
oscilloscope.  In  modern  application  of  this  method,  the  light  is 
usually  supplied  by  a flexible  fiber  glass  cable,  often  in 
combination  with  a fiberoptical  laryngoscope,  which  can  be  used  to 
control  the  placement  of  the  light  source  and  the  gross  appearance 
of  the  larynx. 

Photoglottography  has  been  used  to  study  how  the  glottal  area 
function  varies  with  different  kinds  of  voicing  [Kitzing  and 
Sonesson,  1974;  Kitzing,  1982].  Besides  the  physiological  and 
clinical  applications,  PGG  has  also  been  successfully  used  by 
phoneticians  to  monitor  laryngeal  articulation.  However,  the 
results  of  this  technique  have  sometimes  agreed  with,  and  sometimes 
differed  from,  data  measured  from  ultra-high-speed  films  on  a frame- 
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by-frame  basis  [Wendahl  and  Coleman,  1967;  Harden,  1975]. 
Nevertheless,  it  is  a promising  method  because  it  is  relatively 
inexpensive  and  can  be  used  in  a clinical  setting.  If  the  artifacts 
can  be  isolated  and  the  data  substantiated  through  the  simultaneous 
application  of  other  procedures,  such  as  ultra-high-speed 
photography  or  electroglottography,  then  this  technique  may  be 
effectively  used  in  both  research  and  clinical  environments. 

Ultrasonic  systems  have  been  used  in  a manner  analogous  to  X- 
ray  systems  to  obtain  images  of  the  skeletal  structure  within  the 
body.  The  ultrasonic  wave  can  travel  through  the  body  tissues,  but 
at  each  interface  between  organs  of  different  acoustic  impedance 
there  occurs  some  reflection,  which  can  be  detected  and  measured. 
As  the  impedance  of  the  air  is  considerably  smaller  than  that  of  the 
tissues,  the  reflection  is  almost  total  at  the  level  of  the  vocal 
folds  protruding  into  the  vocal  tract.  Several  methods  of 
ultrasonic  glottography  (UGG)  have  been  developed  [Hamlet,  1981]  to 
make  use  of  these  almost  ideal  conditions  to  measure  laryngeal 
vibrations,  but  the  obstacles  to  a clinically  practicable  UGG  are 
manyfold,  and  none  of  the  methods  presented  so  far  seem  to  have 
gained  general  acceptance. 

Electroglottography  (EGG)  is  based  on  the  electrical 
transmission  of  a high-frequency  current  through  the  tissues  at  the 
glottal  level.  A weak  (microampere)  alternating  current  is  applied 
to  the  electrodes — placed  in  direct  contact  with  the  skin  of  the 
neck  on  each  side  of  the  larynx — from  a signal  generator,  which  may 
be  either  of  constant  voltage  or  constant  current  type.  The 
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frequency  of  the  current  is  typically  in  the  magnitude  of  several 
MHz,  and  the  voltage  level  is  about  0.5  volt,  depending  on  the 
tissue  impedance  and  the  current.  The  vibration  of  the  vocal  folds 
constitute  a varying  impedance  path  that  modulates  a small  part  of 
the  radio  frequency  current  transmitted  between  two  electrodes. 
These  modulations  can  be  detected  and  amplified  to  obtain  the  EGG 
signal.  A functional  block  diagram  of  an  electroglottograph  with  a 
typical  EGG  signal  is  shown  in  Figure  2-1.  The  change  in  impedance 
across  the  larynx  is  primarily  due  to  the  change  in  the  lateral 
contact  area  of  the  vocal  folds  [Fourcin,  1981;  Childers  et  al., 
1984;  Childers  and  Krishnamur thy , 1985].  Hence,  most  investigators 
believe  the  EGG  is  a measure  of  the  amount  of  the  vocal  folds' 
contact  area  and  not  of  the  area  of  the  glottis.  However,  it  has 
been  impossible  so  far  to  confirm  this  by  independent  measurements, 
and  the  factors  causing  the  depicted  changes  in  laryngeal  impedance 
are  not  known  in  detail. 

In  order  for  the  EGG  signal  to  be  useful  to  the  clinicians  or 
speech  researchers,  the  EGG  waveform  must  be  related  to  the  vocal 
folds'  vibrating  cycle.  It  can  best  be  understood  by  comparing  the 
EGG  signal  with  the  glottal  area  function.  Many  researchers  carried 
out  this  work  using  the  EGG  signal  with  synchronized  stroboscopy 
(Fog-Pedersen,  1977],  photoglot tography  [Kitzing,  1982],  and  ultra- 
high-speed  cinematography  [Childers  et  al.,  1982;  Baer  et  al.,  1983] 
along  with  the  glottal  (volume  velocity)  waveform.  Childers  et  al. 
[1982]  found  that  the  point  of  maximum  negative  value  in  a 
differentiated  EGG  signal  agrees  well  with  the  closing  time  of  the 
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EGG 


Figure  2 1.  Functional  block  diagram  of  the  electroglot tograph 
and  the  EGG  signal 
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vocal  folds.  However,  as  shown  in  Figure  2-1,  the  open  phase  of  the 
EGG  signal  normally  lacks  details,  as  the  impedance  is  equally 
maximum  whether  the  glottal  area  is  narrow  or  wide.  Therefore,  it 
should  be  noted  that  it  is  impossible  to  find  the  point  of  maximum 
glottal  opening  with  the  aid  of  the  EGG  signal. 

Although  the  EGG  signal  does  not  seem  appropriate  for  detailed 
monitoring  of  the  glottal  vibrating  cycle,  its  simple  configuration 
with  one  steep  deflection  in  every  period  makes  it  ideally  suitable 
for  measurement  of  the  pitch  period  that  is  inversely  related  to  the 
fundamental  frequency  of  the  voice.  Many  researchers  used  the  EGG 
signal  for  the  reliable  measurement  of  speech  fundamental  frequency. 
Fourcin  [1981]  used  the  EGG  signal  for  pathology  discrimination. 
His  studies  indicated  differences  in  the  frequency  range  and 
distribution  between  normal  subjects  and  subjects  with  voice 
pathology.  Haj  i et  al.  [1986]  computed  both  pitch  period  and 
amplitude  perturbations  for  the  same  purpose.  They  found  some 
correlations  between  the  amount  of  perturbations  of  the  EGG  signal 
and  the  degree  of  the  hoarseness  evaluated  from  the  auditory 
perception  and  from  the  spectrogram. 

The  Mind-Machine  Interaction  Laboratory  at  the  University  of 
Florida  has  conducted  extensive  studies  on  the  EGG  signal  with 
synchronized  high-speed  film  data  and  speech  signal  [Childers  et 
al.,  1982].  Smith  [1980]  used  a discriminant  analysis  procedure  on 
the  distribution  of  the  roots  obtained  by  the  autocorrelation  linear 
prediction  analysis  of  the  EGG  and  speech  signals.  He  showed  that 
the  EGG  signal  analysis  gives  better  results  in  discriminating 
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normal  and  pathological  subjects  than  the  speech  signal  analysis. 
However,  this  may  be  hard  to  say  because  we  are  not  sure  whether  the 
EGG  signal  could  be  represented  well  by  an  all-pole  model  and 
because  he  used  a very  limited  and  typical  data  base.  Alsaka  [1987] 
tried  to  model  the  EGG  signal  to  permit  simulation  of  the  EGG  signal 
with  more  flexibility  for  vibrations  of  the  vocal  folds  under  varied 
conditions.  Using  his  model,  he  simulated  the  effect  of  vocal 
nodules  and  mucus  on  the  EGG  signal.  Alsaka  also  tried  to  classify 
the  laryngeal  pathology  based  on  the  probability  mass  function 
measured  from  the  EGG  signal.  However,  his  model  could  only  be 
applied  to  the  very  typical  case  of  the  EGG  waveform  with  many 
restrictions. 

Acoustic  Analysis 

Traditionally,  laryngologists,  phoneticians,  and  speech 
pathologists  have  relied  on  two  basic  techniques  for  evaluating 
voice  disorders:  listening  to  the  voice  and  viewing  the  larynx  with 
the  aid  of  a mirror  or  laryngoscope.  Diseases  of  the  larynx  are 
often  accompanied  by  changes  in  voice  quality,  and  experienced 
laryngologists  can,  to  some  extent,  detect  some  pathologies  using 
almost  purely  subjective  perceptual  judgement  [Gilmore,  1974; 
Stoicheff  et  al.,  1983].  However,  perceptual  ratings  are  often 
unreliable,  both  within  and  across  raters  [Shipp  and  Huntington, 
1965;  Klich,  1982].  Also  lacking  is  a consistent  vocabulary  for 
describing  laryngeal  disorder  and  intervention  [Perkins,  1971; 
Hirano,  1981]. 
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An  increasing  amount  of  research  has  been  directed  tovards 

finding  objective  methods  of  analyzing  the  speech  signal  for  the 

evaluation  of  the  vocal  function  or  laryngeal  pathology.  This  is 

not  only  because  the  acoustic  method  has  the  advantages  of  being 

totally  noninvasive,  requiring  simple  recording  procedures  and  being 

feasible  on  a small  computer  system,  but  also  because  the  method  has 

promising  clinical  applications  for  the  early  detection  and 

differential  diagnosis  of  a laryngeal  pathology,  as  well  as  for  the 

quantitative  assessment  of  the  vocal  function  of  patients  undergoing 

such  treatments  as  surgery,  voice  therapy,  etc. 

Many  attempts  have  been  made  to  derive  acoustic  measures  that 

are  useful  for  the  detection  and  classification  of  laryngeal 

pathologies  from  the  voice.  The  detection  of  abnormalities  in  the 

vocal  fold  performance  is  generally  based  on  the  spectral  or 

temporal  analysis  of  a speech  segment,  either  in  the  form  of  a 

direct  acoustic  signal  or  a signal  processed  by  an  inverse  filtering 

technique.  The  measures  reported  so  far  are  primarily  associated 

with  the  following  acoustic  characteristics  of  the  speech  signal: 

. perturbations  - jitter  and  shimmer 
. noise  components  included  in  the  signal 
. long  time  average  spectral  characteristics 
. application  of  inverse  filtering 

A cycle-to-cycle  variation  in  pitch  period  that  occurs  when  an 
individual  is  attempting  to  sustain  a phonation  at  a constant 
frequency  is  called  jitter.  A similar  variation  in  pitch  amplitude 
is  known  as  shimmer.  Small  variations  in  pitch  period  and  amplitude 
from  cycle  to  cycle  in  the  speech  waveform  are  known  to  be  a natural 
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phenomenon  in  normal  speech  [Lieberman,  1961].  In  fact,  such 
perturbations  are  important  for  the  natural  quality  of  speech 
synthesis  [Holmes,  1963].  However,  large  perturbations  reflect 
alterations  in  the  normal  vibratory  pattern  of  the  vocal  folds  and 
are  often  associated  with  a laryngeal  dysfunction.  Lieberman  [1961 
and  1963]  proposed  two  parameters  to  measure  pitch  period 
perturbations.  One  is  the  pitch  perturbation  (PP)  that  is  defined 
as  the  time  difference  between  the  duration  of  successive  pitch 
periods  in  the  speech  signal,  and  the  other  is  the  pitch 
perturbation  factor  (PPF),  that  is,  the  relative  frequency  of  a 
pitch  perturbation  larger  than  0.5  msec  occurring  in  a steady  vowel 
sound.  Koike  [1969  and  1973]  observed  that  normal  subjects 
phonating  a steady  vowel  sound  exhibit  a slow  and  relatively  smooth 
change  in  the  pitch  period  and  amplitude  sequences.  He  then 
proposed  a perturbation  quotient  as  a quantitative  measure  of  the 
variation  based  on  the  moving  average  of  a sequence.  There  are  also 
many  derivatives  of  these  perturbation  measures  [Hecker  and  Kreul, 
1971;  Koike  and  Takahashi , 1971;  Kasuya  et  al.,  1983].  These 
perturbation  analyses  could  be  done  with  a direct  speech  signal  or 
glottal  waveform  obtained  from  inverse  filtering  of  the  speech 
signal. 

Pathological  voices  (voices  of  patients  with  a laryngeal 
pathology)  often  reveal  turbulent  noises  mainly  resulting  from  an 
incomplete  closure  of  the  glottis  caused  by  pathological  changes  of 
the  vocal  folds  [Isshiki  et  al.,  1966;  Fritzell  et  al.,  1983]. 
Also,  it  has  long  been  known  that  the  degree  of  perceived  hoarseness 
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of  pathological  voices  is  closely  correlated  with  the  amount  of 
vocal  noise  relative  to  the  level  of  the  harmonic  components.  Since 
Yanagihara  [1967a  and  1967b]  reported  the  close  correlation  between 
the  degree  of  perceived  hoarseness  and  the  amount  of  noise  observed 
in  sound  spectrograms,  much  research  has  been  done  on  the  evaluation 
of  vocal  noise  included  in  a pathological  voice  from  acoustic  and 
perceptual  viewpoints. 

Sansone  and  Emanuel  [1970]  estimated  the  noise  level  in  the 
spectrum  of  sustained  vowels,  and  found  a relationship  between  the 
spectral  noise  level  (SNL)  and  the  perceived  degree  of  the  roughness 
quality  of  the  voice.  Sound  spectrographic  analysis  was  conducted 
by  Imaizumi  et  al.  [1980],  who  measured  the  average  level  of  the 
noise  components  relative  to  that  of  the  harmonic  components  and 
applied  it  to  the  evaluation  of  hoarseness. 

Kojima  et  al.  [1980]  reported  the  objective  estimation  of 
hoarseness  using  Fourier  analysis  in  order  to  separate  the  noise 
from  the  periodic  components.  The  resulting  signal-to-noise  (S/N) 
ratio  showed  a significant  correlation  with  psychophysical  ratings 
of  the  degree  of  hoarseness. 

Yumoto  et  al.  [1982]  tried  to  separate  harmonic  components  from 
vocal  noise  by  averaging  50  consecutive  periods,  assuming  the  strict 
long-term  stationarity  of  the  signal  and  additive  noise  with  a zero 
mean,  and  proposed  the  harmonic-to-noise  (H/N)  ratio  as  an  acoustic 
measure  of  the  degree  of  hoarseness.  Hiraoka  et  al.  [1984]  used 
relative  harmonic  intensity  to  evaluate  hoarse  voices,  but  it  is 
difficult  to  isolate  harmonic  components  from  noise  components  in 
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the  high-frequency  range  when  pathological  voices  sometimes  show 
larger  noise  levels  than  harmonic  levels. 

Instead  of  assuming  a long-term  stationarity  of  the  signal, 
Kasuya  et  al.  [1986]  tried  to  estimate  the  amount  of  vocal  noise 
from  a frame-based  signal  with  adaptive  filtering.  They  devised  an 
adaptive  comb  filtering  method  operating  in  the  frequency  domain  to 
estimate  noise  components  from  a sustained  vowel  phonation  and 
proposed  an  acoustic  measure  called  normalized  noise  energy  (NNE). 
They  showed  that  NNE  is  very  effective  in  discriminating  laryngeal 
diseases  that  are  closely  related  to  incomplete  glottal  closure  such 
as  T2-T4  glottic  cancer,  recurrent  nerve  paralysis,  and  the  vocal 
fold  nodules. 

Another  technique  is  the  long-time-average  spectra  (LTAS)  used 
by  Frokj aer-Jensen  and  Prytz  [1976].  This  method  utilizes  a real 
time,  narrow  band  spectral  analyzer  to  compute  the  spectra  of  the 
speech  signal  averaged  over  45  seconds.  Then  several  parameters  of 
the  resulting  spectra  are  examined.  These  parameters  include  the 
fundamental  frequency  (Fo),  the  first  formant  (FI),  the  frequency  of 
minimum  spectrum  level  between  Fo  and  FI  (Fmin),  the  level  of 
maximum  peak  in  the  entire  spectrum  and  its  frequency  (Lmax,  Fmax), 
the  quotient  of  Fo  and  FI,  the  harmonic  richness  defined  as  the 
energy  below  1 KHz  divided  by  the  energy  above  1 KHz,  etc.  Among 
them,  harmonic  richness,  the  spectral  slope  inclination  in  the  first 
formant  region,  and  the  ratio  between  peak  levels  of  the  fundamental 
frequency  and  first  formant  region  are  considered  important  ones. 
Using  the  statistical  approach  to  the  parameters  obtained  by  LTAS, 
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Uendler  et  al.  [1980]  tried  to  classify  the  voice  qualities  - 
normal,  breathy,  hoarse  and  rough.  However,  it  still  needs  more 
improvement  to  be  useful  for  clinical  applications. 

Many  investigations  have  reported  on  the  direct  extraction  of 
glottal  waveform  from  the  speech  signal  for  the  assessment  of 
laryngeal  function  [Deller,  1982;  Milenkovic,  1986;  Hunt,  1987]. 
Sondhi  [1975]  and  Monsen  and  Engebretson  [1977]  used  a long 
reflectionless  metal  tube  that  acts  as  a pseudoinfinite  termination 
of  the  vocal  tract,  and  measured  the  glottal  waveform  directly  for 
the  neutral  vowel  phonation  by  placing  a microphone  in  the  tube. 
This  method  has  a difficulty  in  making  the  reflectionless  tube,  and 
has  limited  vowel  application  (only  netural  vowels  can  be  used). 

Inverse  filtering  technique  is  the  most  popular  method  used  for 
estimating  the  glottal  waveform.  The  theoretical  foundation  of  this 
procedure  is  found  on  the  premise  that  the  glottal  waveform  modified 
by  the  resonance  and  damping  characteristics  of  the  vocal  tract  is 
finally  radiated  from  the  lips  as  speech.  Therefore,  if  a model  of 
the  vocal  tract  could  be  constructed,  it  would  be  possible  to 
process  the  speech  or  acoustic  signal  in  the  reverse  manner  to 
obtain  an  estimate  of  the  glottal  waveform.  Assuming  that  a 
laryngeal  disease  adversely  affects  the  vibratory  motion  of  the 
vocal  folds,  modifying  the  glottal  waveform,  this  method  could  be 
effectively  used  for  a pathology  detection.  Many  signal  processing 
procedures  have  been  reported  for  the  accurate  estimation  of  the 
glottal  waveform,  however,  it  is  still  in  the  preliminary  stage  with 
respect  to  its  complete  automation  and  accuracy,  especially  in  the 
analysis  of  pathological  voices. 
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The  linear  prediction  method  of  inverse  filtering  has  also  been 
applied  to  the  speech  signal  to  obtain  the  error  signal  or  residue 
from  which  physically  significant  features  of  laryngeal  pathology 
may  be  obtained.  Koike  and  Markel  [1975]  have  analyzed  a number  of 
normal  subjects  and  patients  with  various  laryngeal  pathologies  by 
comparing  qualitatively  the  residue  signals  derived  by  inverse 
filtering.  According  to  their  results,  comparisons  between  the 
acoustic  speech  waveform  and  residue  signals  illustrate  the 
superiority  of  the  residue  signal  for  detecting  irregularity. 

Davis  [1976]  extended  this  work  by  developing  several 
quantitative  parameters  relative  to  the  residue  signal  for  the 
classification  of  normal/pathological  subjects.  His  work  was  very 
extensive,  and  he  used  the  residue  signal  to  get  several  speech 
related  parameters.  Prosek  et  al.  [1987J  conducted  some  experiments 
to  assess  the  correlations  of  residue  features  with  some  perceptual 
properties  of  voice  disorder.  They  showed  that  the  residue  features 
may  be  useful  in  assessing  the  degree  of  vocal  impairment;  however, 
using  the  features  as  correlates  of  voice  quality  was  not  suitable. 

Deller  and  Anderson  [1980]  used  the  inverse  filter  to  get 
information  about  aberrant  laryngeal  behavior  by  examining  the 
placement  of  the  zeros  of  the  inverse  filter  in  the  z-plane  over 
consecutive  short  segments  of  synthetic  speech  signal.  Smith  [1980] 
did  similar  works  with  both  speech  and  EGG  signals.  Muta  et  al. 
[1987]  analyzed  the  residue  signal  with  a different  approach.  They 
measured  noise  components  in  the  residue  signal  by  estimating  the 
harmonic  component  of  the  error  signal,  and  then  obtained  a noise  to 
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signal  ratio  for  the  quantitative  evaluation  of  a mild  or  a moderate 
hoarse  voice. 

Psycho-Acoustic  Evaluation 

Listeners'  impressions  of  a voice  quality  are  widely  used  in 
clinical  practice  by  speech  pathologists  and  otolaryngologists  to 
assess  voice  disorders.  This  is  called  psycho-acoustic  evaluation. 
The  psycho-acoustic  parameters  are  the  voice  pitch,  loudness, 
laryngeal  quality,  and  resonance,  etc.  [Moore,  1971].  A voice  is 
judged  as  abnormal  when  any  of  these  parameters  deviate  from  the 
expected  range  of  persons  of  the  same  age,  sex  and  cultural 
background  [Askenfelt  and  Hammarberg,  1986;  Bassich  and  Ludlow, 
1986]. 

The  pitch  is  the  perceptual  correlate  of  the  frequency  of  the 
folds'  vibration,  also  referred  to  as  the  fundamental  frequency.  It 
is  judged  atypical  or  defective  if  it  is  too  high,  too  low, 
monotonous,  or  tremulous  [Boone,  1971].  Loudness  is  the  sensation 
related  to  the  amplitude  of  the  molecular  motion  in  the  sound  wave. 
It  is  judged  abnormal  when  a voice  is  too  loud  or  too  quiet  in 
relation  to  a specific  environmental  situation  or  when  the  loudness 
variation  is  inappropriate  to  the  meaning  of  the  utterance  [Boone, 
1971]. 

The  quality  of  the  voice  is  not  as  easily  defined. 
Professionals  use  many  terms  to  designate  perceptions  of  different 
voice  qualities.  Voice  quality  disorders  encompass  a wide  range  of 
voice  disorders.  A voice  with  a quality  disorder  can  be  breathy, 


28 


rough  or  harsh,  hoarse,  husky,  throaty,  metallic,  hypernasal,  etc. 
The  descriptive  nature  of  these  terms  lends  themselves  to  different 
interpretations  by  different  voice  pathologists.  Thus,  perceptual 
ratings  of  voice  quality  have  demonstrated  varying  degrees  of  inter- 
and  intrajudge  reliability  [Shipp  and  Huntington,  1965;  Klich,  1982; 
Askenfelt  and  Hammarberg,  1986;  Bassich  and  Ludlow,  1986]. 
Reliability  also  may  vary  with  the  type  and  length  of  voice  samples 
judged,  the  amount  of  listener  training,  the  type  of  listening  task, 
and  the  number  of  dimensions  rated. 

Therefore,  the  psycho-acoustic  evaluation  procedure  suffers 
from  nonstandardization  of  the  terms  used  to  describe  the 
pathological  voice.  The  classifications  of  a voice  as  hoarse, 
breathy,  harsh,  rough,  etc.  have  different  meanings  to  different 
voice  pathologists  because  the  descriptive  terms  used  for  diagnosis 
are  not  standardized.  Many  researchers  are  currently  working  to 
provide  a universally  acceptable  method  for  defining  these  terms  and 
a standard  procedure  to  measure  the  extent  of  these  disorders  using 
a generally  accepted  scale. 

Identifying  People  by  their  Voice 

When  a person  speaks,  he  produces  a complex  acoustic  signal, 
providing  various  forms  of  information.  This  signal  conveys 
primarily  a linguistic  message;  listeners  who  are  familiar  with 
language  can  transcribe,  or  at  least  repeat,  what  the  speaker  said. 
Besides  conveying  a message,  the  speech  signal  reflects  some  of  the 
anatomy  and  physiology  of  the  speaker.  For  example,  listeners  can 
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often  determine  the  speaker's  sex,  approximate  age,  emotional  state, 
and  whether  he/she  is  suffering  from  a voice  disorder.  Of 
particular  interest  is  the  ability  of  listeners  to  distinguish  the 
characteristics  of  different  speakers.  This  ability  provides  the 
basis  for  one  method  of  speaker  recognition. 

There  are  three  general  methods  of  speaker  recognition.  They 
are  speaker  recognition  by  listening,  by  visual  comparison  of 
spectrograms,  and  by  machine.  Among  them,  speaker  recognition  by 
machine  has  received  a great  deal  of  attention  from  speech 
researchers.  The  acoustic  aspects  that  differentiate  voices  are 
difficult  to  separate  from  signal  traits  which  reflect  the  identity 
of  the  sounds  (i.e.,  the  abstract  linguistic  units  called  phonemes). 

There  are  two  sources  of  speech  variation  among  speakers.  One 
is  anatomical,  i.e.,  the  differences  in  size  and  shape  of  the  vocal 
folds  and  the  vocal  tract  shape,  and  the  other  is  learned  habits, 
e.g.,  the  differences  in  speaking  style.  The  latter  includes 
variations  in  both  target  vocal  tract  positions  for  phonemes  and 
dynamic  aspects  of  speech  such  as  speaking  rate.  There  are  still  no 
acoustic  cues  specially  or  exclusively  dealing  with  speaker 
identity.  Most  of  the  parameters  and  features  used  in  speech 
analysis  contain  information  useful  for  the  recognition  of  both  the 
speaker  and  the  spoken  message. 

Speaker  Recognition 

Speaker  recognition,  depending  on  the  nature  of  the  final  task, 
can  be  classified  into  two  related  but  different  categories  of  voice 
recognition:  speaker  verification  and  speaker  identification. 
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Both  tasks  use  a stored  data  base  of  reference  patterns,  or 
templates,  for  N known  speakers,  and  similar  analysis  and  decision 
making  techniques  are  employed.  Here  we  will  review  the  voice 
characteristics  and  identification  techniques  that  are  used  in 
recognizing  people  by  voice. 

Verification  vs.  identification.  In  a speaker  verification 
task,  the  recognizer  is  asked  to  verify  an  identity  claim  made  by  an 
unknown  speaker  and  a decision  to  reject  or  to  accept  the  identity 
claim  is  made.  This  is  a simple  task  because  it  only  requires 
comparing  the  test  pattern  against  one  reference  pattern  and  it 
involves  a binary  decision  whether  the  test  speech  matches  the 
template  of  the  claimed  speaker.  Speakers  known  to  the  system  who 
claim  their  true  identity  are  called  customers,  while  others  are 
imposters.  In  a speaker  identification  task,  the  recognizer  is 
asked  to  decide  which  out  of  a set  of  N speakers  is  best  classified 
as  the  unknown  speaker.  Furthermore,  the  test  speaker's  voice  may 
not  be  among  the  N stored  patterns,  in  which  case  a "no  match" 
decision  is  required. 

Depending  on  the  input  speech  materials  used  for  speaker 
recognition,  the  system  becomes  either  text  dependent  where 
utterances  of  the  same  text  are  used  for  training  and  testing,  or 
text  independent  where  training  and  testing  involve  utterances  from 
different  texts.  The  text  dependent  system  permits  the  simple 
comparison  of  word  templates  by  establishing  a nonlinear  time 
alignment  between  test  input  and  reference  utterances,  and  occurs 
frequently  in  speaker  verification  tasks,  but  rarely  for  an 
identification  task  [O' shaughnessy , 1986]. 
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The  text  independent  system  can  also  use  template  matching,  but 
very  different  information  may  be  stored  in  the  templates  than  in 
the  text  dependent  system  [Cheung  and  Eisenstein,  1978].  Since 
there  is  no  possibility  of  time  alignment  between  test  input  and 
reference  utterances  in  the  text  independent  system,  speaker 
characteristic  is  carried  out  either  by  statistical  average  over 
selected  acoustic  features,  or  by  locating  comparable  speech  events 
in  test  and  reference  utterances.  Therefore,  the  error  rate  for 
text  independent  recognition  is  considerably  higher  than  for  the 
comparable  text  dependent  one.  To  achieve  good  results  for  the  text 
independent  speaker  recognition  system,  much  more  speech  data  are 
usually  needed  for  both  training  and  testing  than  for  the  text 
dependent  system. 

Intraspeaker  and  interspeaker  variability.  It  is  well  known 
that  the  pronunciation  of  a given  word  or  phrase  tends  to  vary  from 
speaker  to  speaker.  Acoustical  analysis  of  utterances  by  several 
speakers  typically  reveals  many  dissimilarities.  These 
dissimilarities  between  speakers  are  called  interspeaker  variability 
[Hecker,  1971].  Interspeaker  variability  in  the  speech  signal  can 
be  attributed  in  part  to  organic  differences  in  the  structure  of  the 
vocal  mechanism  and  in  part  to  learned  differences  in  the  use  of  the 
vocal  mechanism  during  speech  production.  Organic  differences  may 
be  determined  by  heredity,  sex,  and  age,  while  learned  differences 
may  be  related  to  geographical,  social,  and  cultural  factors. 

In  vowel  production,  the  detailed  shape  of  the  glottal  waveform 
is  influenced  by  many  anatomical  and  physiological  factors, 
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including  the  dimensions  of  the  vocal  folds,  control  of  the 
laryngeal  muscles,  and  intensity  of  vocal  effort  [Hecker,  1971]. 
These  factors  tend  to  differ  among  speakers.  Even  for  a particular 
speaker,  several  factors  can  differ  considerably  from  utterance  to 
utterance.  Thus  the  glottal  waveform  can  contribute  to  the 
variability  between  speakers  or  within  a speaker.  The  vocal  tract 
transfer  function  also  reflects  individual  differences  in  the 
dimensions  of  the  vocal  tract  [Stevens  and  House,  1961]. 

The  speech  and  EGG  waveforms  for  a sustained  vowel  / i / uttered 
by  three  males  and  three  female  speakers  are  shown  in  Figure  2-2. 
From  the  dissimilarity  of  the  EGG  waveform  of  each  speaker,  we  could 
guess  the  contribution  of  the  glottal  waveform  to  speaker 
variability. 

In  generating  an  utterance,  a speaker  strives  to  produce 
appropriate  respiratory,  laryngeal,  and  articulatory  activities. 
However,  he/she  is  unconcerned  about  the  details  of  the  resulting 
speech  signal  because  many  features  of  this  signal  are  not  critical 
to  vocal  communication  between  people.  Therefore  the  same  speaker 
rarely  utters  a given  word  twice  in  exactly  the  same  way,  even  when 
the  utterances  are  produced  in  succession.  This  effect  within  the 
speaker  is  called  intraspeaker  variability  [Hecker,  1971]. 

Since  we  use  the  sustained  vowel  only  for  this  research,  we  set 
some  examples  with  vowel  sounds.  The  speech  and  EGG  waveforms  for  a 
sustained  vowel  / i / uttered  by  the  same  speaker,  varying  the  pitch 
period  slightly,  are  shown  in  Figure  2-3.  We  can  easily  see  the 
variability  between  waveforms  of  the  same  vowel  utterance  for  a 
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Male  speakers 


Female  speakers 


50  msec 


50  msec 


Figure  2-2.  Speech  and  EGG  waveforms  of  the  vowel  / i / produced  by 
six  speakers 

Male  speakers:  AVB,  DAS,  GPM 
Female  speakers:  BPV,  CAP,  SAS 
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50  msec 


50  msec 


Figure  2-3.  Speech  and  EGG  waveforms  of  the  vowel  / i / produced  by 
a single  speaker  (BKS) 
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single  speaker.  The  LPC  spectra  of  the  sustained  vowel  / i / for  six 
different  speakers  are  illustrated  in  Figure  2-4.  These  spectra 
were  obtained  by  the  pitch  synchronous  LPC  analysis  for  10 
consecutive  pitch  periods.  The  speech  signal  was  preemphasized 
before  analysis.  We  can  see  there  exists  significant  differences 
between  spectra,  despite  the  fact  that  all  the  speakers  uttered  the 
same  vowel  / i / . Figure  2-5  shows  the  LPC  spectra  of  the  sustained 
vowel  /i/,  recorded  at  different  times,  for  the  same  speaker  by 
varying  pitch  period  and  tone  very  slightly.  Pitch  synchronous  LPC 
analysis  was  done  for  10  consecutive  pitch  periods.  Since  we  used 
the  same  vowel  for  the  same  speaker,  the  formant  structures  are 
almost  similar.  However,  we  can  see  the  variations  between 
different  recording  sessions. 

The  success  of  any  method  of  speaker  recognition  depends  on  the 
degree  to  which  sampled  interspeaker  variability  is  greater  than  the 
sampled  intraspeaker  variability.  Because  speaker  variability  is  a 
reflection  of  many  differences  in  speech  production,  both  forms  of 
variability  are  extremely  difficult  to  quantify. 

Recognition  Techniques 

Speaker  recognition  is  an  example  of  a pattern  recognition 
task,  and  uses  standard  pattern  recognition  techniques.  In  essence, 
a speaker  recognition  system  requires  a mapping  between  speech  and 
speaker  identity  so  that  each  possible  input  speech  is  identified 
with  its  corresponding  speaker.  The  pattern  recognition  task  can  be 
divided  into  two  parts:  training  and  recognition.  The  training  part 
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Male  speakers 


Female  speakers 
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FREQUENCY  (KHZ) 


Figure  2-4.  LPC  spectra  of  the  vowel  /i/  obtained  from  six  speakers 
(10th  order  pitch  synchronous  analysis,  Hamming  window) 
The  spectra  for  10  LPC  vectors  are  superimposed  for 
each  subject.  (Male  speakers:  DMH,  JRS,  AAA) 

(Female  speakers:  BEM,  MBK,  DVD) 
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Figure  2-5.  LPC  spectra  of  the  vowel  / i / obtained  from  a single 
speaker  (BKS).  The  spectra  for  10  LPC  vectors  are 
superimposed  in  each  plot. 

(10th  order  pitch  synchronous  analysis,  Hamming  window) 
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establishes  reference  patterns  which  are  assigned  speaker  labels. 
In  the  recognition  part,  appropriate  decision  rules  are  used  either 
to  assign  input  test  patterns  to  one  of  the  speakers  or  to  verify  if 
the  input  has  originated  from  the  speech  of  the  claimed  speaker. 
Given  an  input  utterance,  the  reference  templates  may  be  searched  to 
find  an  exact  or  close  match  using  some  similarity  measure,  and  the 
system  output  would  provide  the  appropriate  identity. 

Applying  pattern  recognition  methods  to  speaker  recognition 
involves  several  steps  as  shown  in  Figure  2-6:  normalization, 
parameterization,  feature  extraction,  a similarity  comparison,  and 
decision  [O' shaughnessy , 1986).  The  initial  normalization  step 
attempts  to  eliminate  variability  in  the  input  speech  signal  due  to 
environmental  conditions  such  as  recording  levels  or  background 
noise.  The  simplest  form  of  normalization  adjusts  the  maximum 
signal  amplitude  to  a standard  level  to  account  for  variations  in 
recording  levels,  distance  from  the  microphone,  and  original  speech 
intensity.  Major  data  reduction  occurs  in  converting  the  signal 
into  parameters  and  features. 

A variety  of  parameters  can  be  extracted  from  the  speech  signal 
either  directly  or  after  spectral  transformation  in  the  frequency 
domain.  To  parameterize  the  speech  signal  efficiently,  a standard 
speech  model  is  used,  which  separates  the  excitation  from  the  vocal 
tract.  Excitation  is  typically  represented  in  terms  of  a voicing 
decision,  overall  amplitude,  and  estimation  of  the  fundamental 
frequency,  Fo,  during  voiced  speech.  Common  parameters  are  the  LPC 
coefficients,  the  cepstral  coefficient,  the  channel  energies  in  a 
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Speech 


Recognized  Speaker 


Figure  2-6. 


Block  diagram  of  the  speaker  recognition  system 
using  pattern  recognition  method 
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channel  vocoder,  or  some 

form  of 

reduced  Fourier  transform.  They 
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Procedures  for  selecting 

an 

efficient  set  of  acoustic  parameters  have  been  discussed  extensively 
in  the  literature  [Wohlford,  1980]. 

The  major  part  of  the  recognition  process  is  to  measure  a 
similarity  between  templates  of  parameters  or  feature 
representations  of  a test  speech  signal  and  of  a reference  signal. 
In  a speaker  identification  system,  the  reference  most  closely 
matching  the  test  is  usually  chosen,  yielding  the  output  of  the 
speaker  identity  corresponding  to  that  reference.  However,  in  a 
speaker  verification  system,  the  decision  to  accept  or  reject  the 
claim  is  made  according  to  the  pre-assigned  threshold  level. 

The  template  representation  of  a small  portion  or  frame  of 
speech  using  M features  can  be  viewed  as  an  M-dimensional  vector. 
Then  the  similarity  between  two  templates  can  be  viewed  as  inversely 
proportional  to  their  separation  distance  in  M-dimensional  space. 
One  standard  measure  is  a quadratic  distance,  which  for  two  M- 
dimensional  templates  x and  y is  given  by  [Tou  and  Gonzales,  1974] 

d(x,y)=(x-y) ' W_1(x-y)  (2.1) 

where  V is  a positive  definite  matrix  which  allows  different 
weighting  for  individual  features  of  the  template,  depending  on 
their  utility  in  the  feature  space.  The  common  Euclidean  distance 
sets  W to  be  the  identity  matrix  I,  whereas  the  general  Mahalanobis 
distance  sets  W to  be  the  autucovariance  matrix  corresponding  to  the 
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reference  vector  [Tou  and  Gonzales,  1974].  A number  of  distortion 
measures  have  been  introduced  which  appear  more  subjectively 
meaningful  than  the  standard  distance  measure  for  the  applications 
to  speech  signal.  These  approaches  typically  attempt  to  stabilize 
and  statistically  characterize  orthogonal  spectral  vector 
combinations,  cepstral  coefficients,  and  a variety  of  LPC-based 
parameters  [Vohlford,  1980]. 

In  discussing  distance  measures,  a comparison  of  stationary 
sounds  was  implicitly  assumed,  where  a single  feature  representation 
of  each  utterance  would  be  sufficient.  Most  attempts  at  speaker 
recognition,  however,  use  vocabularies  that  involve  sequences  of 
different  acoustic  events.  Since  automatic  segmentation  of 
utterances  into  meaningful  linguistic  units  such  as  phonemes  or 
syllables  is  difficult,  templates  are  usually  compared  on  a frame- 
by-frame  basis.  This  leads  to  alignment  problems.  Utterances  are 
generally  spoken  at  different  speaking  rates,  even  for  a single 
speaker  repeating  the  same  word.  Most  high-performance  speech  or 
speaker  recognition  systems  address  the  problem  of  alignment  by 
nonlinearly  warping  one  template  in  an  attempt  to  align  similar 
acoustic  segments  in  the  test  and  reference  templates.  The 
procedure,  called  dynamic  time  warping  (DTW),  combines  alignment  and 
distance  computation  through  a dynamic  programming  procedure  [Sakoe 
and  Chiba,  1978 ] . 

Since  acoustic  cues  to  a speaker's  identity  are  spread 
throughout  each  of  his/her  utterances,  many  systems  utilize 
templates  of  averaged  parameters  rather  than  the  full  time  sequences 
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of  parameters  used  in  speech  recognition  [O' shaughnessy , 1986]. 
This  statistical  approach  is  most  useful  in  text  independent  cases, 
since  the  sequences  of  training  and  testing  utterances  do  not 
correspond.  Use  of  the  long-term-average  spectrum  as  a feature 
vector  was  discovered  to  have  potential  for  text  independent 
recognition  during  initial  exploratory  studies  of  text  independent 
recognition  using  the  spectral  pattern  matching  technique  [Wohlford, 
1980],  Unfortunately  the  long  term  spectrum  is  not  a good  and 
stable  feature  vector  to  use  for  speaker  recognition,  since  it  is 
sensitive  to  change  in  the  spectral  response  of  any  interposed 
communication  channel.  Furthermore,  it  is  not  particularly  stable 
across  variations  in  the  speaker's  speech  effort  level  [Doddington, 
1985]. 

Another  approach  to  text  independent  speaker  recognition  is  to 
search  for  specific  phonetic  events  in  the  incoming  speech  signal 
and  then  to  compare  the  speech  features  belonging  to  the  matching 
phonetic  event  of  the  reference  speakers  [Doddington,  1985].  The 
problem  with  this  approach  is  that  errors  in  detecting  phonetic 
events  tend  to  corrupt  the  speaker  recognition  process.  Even  using 
speaker  specific  phonetic  references,  the  reliability  of  phonetic 
detection  is  currently  inadequate  to  support  good  speaker 
recognition  performance  [Pfeifer,  1978]. 

A new  technique,  vector  quantization  (VQ)  [Rabiner  et  al., 
1983;  Makhoul  et  al.,  1985;  Pan  et  al.,  1985;  Soong  et  al.,  1987; 
Soong  and  Rosenberg,  1988],  has  been  successfully  applied  to  both 
speech  recognition  and  speaker  recognition  in  similar  ways  and  for 
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similar  reasons,  i.e.,  to  avoid  the  problem  of  segmenting  into 
meaningful  subunits.  Vector  quantization  provides  an  alternative  to 
DTV  when  it  is  applied  to  speaker  recognition.  The  vector 
quantization  is  a coding  technique  typically  used  to  lower  the 
transmission  rate.  The  key  issues  in  implementing  VQ  concern  the 
design  and  search  of  the  codebook.  Codebook  design  involves  the 
analysis  of  a large  training  sequence  of  speech  sufficiently  varied 
to  contain  examples  of  phonemes  in  many  different  contexts.  An 
iterative  design  procedure  is  used  to  converge  upon  a locally 
optimal  codebook  (optimal  in  the  sense  that  the  average  distortion 
measure  is  minimized  across  the  training  set). 

A separate  VQ  codebook  is  usually  designed  for  each  combination 
of  speaker  and  vocabulary  word,  based  on  one  or  more  utterances  of 
the  word.  Each  test  template  is  evaluated  by  all  codebooks.  The 
speaker  corresponding  to  the  codebook  yielding  the  lowest  distortion 


measure  is  selected  as  the 

recognizer 

output.  For 

speaker 

verification,  the  distortion 

between  the 

test 

template 

and 

the 

codebook  of  the  claimed  speaker 

is  compared 

to  a 

threshold. 

The 

VQ 

is  computationally  efficient 

as  compared 

to 

the  storing 

and 

comparing  a large  amount  of  template  data  in  the  form  of  individual 
spectra.  Therefore,  VQ  can  be  useful  for  text  dependent  as  well  as 
text  independent  speaker  recognition. 

Research  Rationale 

We  have,  so  far,  reviewed  detection  methods  of  laryngeal 
dysfunction  and  some  techniques  of  speaker  recognition.  Our 
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hypothesis  for  years  has  been  that  the  vocal  fold  vibratory  pattern 
is  perhaps  the  major  factor  relating  the  laryngeal  function  to  the 
sound  produced  [Childers  et  al.,  1982  and  1984;  Childers  and 
Krishnamurthy , 1985;  Moore  et  al.,  1985;  Alsaka,  1987].  Thus,  a 
laryngeal  pathology  becomes  acoustically  perceptible  when  the 
dysfunction  is  shown  on  the  vocal  fold  vibratory  pattern. 

Applications  of  linear  prediction  to  the  speech  signal  showed 
that  it  could  be  utilized  to  effectively  classify  speakers  having  a 
normal  or  abnormal  larynx  [Davis,  1976;  Deller  and  Anderson,  1980; 
Smith,  1980].  Since  the  LPC  provides  a good  parametric 
representation  of  the  speech  waveform,  it  can  be  considered  to 
reflect  changes  in  the  speech  signal  due  to  the  laryngeal 
dysfunction.  Therefore,  to  capture  the  speech  waveform  perturbation 
in  the  frequency  domain,  we  investigated  the  LPC  spectrum  variations 
of  the  given  speech  signal. 

Perturbation  analysis  of  the  fundamental  frequency  and 
amplitude  from  the  EGG  signal  has  been  found  to  be  useful  to  assess 
the  clinical  significance  of  voice  disorders  [Haji  et  al.,  1986]. 
To  obtain  more  detailed  information  about  the  vibratory  pattern  of 
the  vocal  folds  from  the  EGG  signal,  we  analyzed  the  EGG  waveform 
extensively  by  measuring  time  intervals  and  amplitude  differences  at 
various  points  per  cycle  for  some  fixed  time  duration.  Then  we 
examined  these  values  to  derive  some  useful  parameters  for  the 
laryngeal  pathology  detection. 

Speaker  identification  using  a sustained  vowel  phonation  for 
training  and  testing  utterances  can  have  many  advantages  compared  to 
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speaker  identification  using  general  texts.  First  the  sustained 
vowel  phonation  is  stable  over  time.  Since  it  will  be  affected 
little  by  the  speaker's  speaking  style  or  speaking  rate,  we  may  not 
need  time  alignment  nor  dynamic  time  warping  for  the  analysis.  It 
is  also  easy  to  extract  features  from  sustained  vowels  because  we  do 
not  need  exact  detection  of  silence,  utterance  beginning  and  ending 
points,  etc. 

Soong  et  al.  [1987]  proposed  a VQ  approach  to  speaker 
recognition  using  the  LPC  vectors  for  isolated  digits  as  feature 
vectors.  We  investigated  the  intraspeaker  and  interspeaker 
variabilities  of  the  pitch  synchronous  LPC  spectral  distortion  for 
the  sustained  vowel  phonation.  Then  we  proposed  a speaker 
identification  scheme  using  the  speaker-based  VQ  codebook  of  the 
sustained  vowel  phonation. 


CHAPTER  THREE 


LINEAR  PREDICTION  AND  VECTOR  QUANTIZATION 
Linear  Prediction 

Introduction 

With  the  advent  of  electronic  digital  computers,  computational 
modeling  techniques  for  the  time  series  analysis  have  been  developed 
in  various  fields  such  as  statistics,  control  theory,  and 
communications.  In  time  series  analysis,  each  continuous  time 
signal  s(t)  is  sampled  to  obtain  a discrete  time  signal  s(nT),  also 
known  as  a time  series,  where  n is  an  integer  variable  and  T is  the 
sampling  interval.  The  sampling  frequency  is  then  fs=l/T. 
Henceforth,  assuming  normalization  of  fs  by  1/T,  we  will  abbreviate 
s(nT)  by  s(n)  with  no  loss  of  generality. 

A major  concern  for  this  discrete  time  signal  sequence  is  that 
of  system  modeling.  It  is  clear  that  if  one  is  successful  in 
developing  a parametric  model  for  the  behavior  of  some  signals,  then 
the  model  can  be  used  for  different  applications,  such  as 
prediction,  data  compression,  and  spectral  estimation  with  high 
resolution.  In  many  applications,  the  underlying  physical 
environment  generating  the  signal  can  be  well  modeled  by  a linear 
rational  system  of  low  order.  In  speech,  for  instance,  it  is  known 
that  a good  model  for  the  speech  generating  mechanism  is  an  all-pole 
linear  system  [Markel  and  Gray,  1976]. 
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In  general,  a linear  rational  system  has  an  input-output 
relationship  described  by  a linear  difference  equation: 

s(n)  = ^f^a^sCn-k)  + G ^?ob^u(n-i)  , b0=l  (3.1) 

where  a^,  l<k<p,  b} , l<i<q,  and  the  gain  G are  the  parameters  of  the 
hypothesized  system,  and  u(n)  and  s(n)  are  the  input  and  output 
signals  of  the  system,  respectively.  Equation  (3.1)  says  that  the 
output  s(n)  is  predictable  from  linear  combinations  of  past  outputs 
and  present  and  past  inputs.  This  model  is  known  as  the  pole-zero 
model.  The  transfer  function  of  this  system  is 


H(z) 


S(z) 


= G 


1 + ^l^b-^z  * 
1 - Jiakz_k 


(3.2) 


where  S(z)  and  U(z)  are  z-transform  of  s(n)  and  u(n),  respectively. 
The  roots  of  the  numerator  and  denominator  polynomials  are  the  zeros 
and  the  poles  of  the  model,  respectively. 

There  are  two  special  cases  of  the  model  that  are  of  interest: 

1)  all-zero  model  : a^O,  l<k<p 

2)  all-pole  model  : b^=0,  l<i<q 

The  all-zero  model  is  known  as  the  moving  average  (MA)  model,  and 
the  all-pole  model  is  known  as  the  autoregressive  (AR)  model 
[Makhoul,  1975].  The  pole-zero  model  is  then  known  as  the 
autoregressive  moving  average  (ARMA)  model.  Hereafter,  we  will 
focus  entirely  on  the  all-pole  modeling  technique  for  the  speech 
signal,  which  is  commonly  referred  to  a linear  predictive  coding 


(LPC) . 
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Model  of  Voiced  Speech  Production 

Speech  sounds  are  produced  as  a result  of  acoustical  excitation 
of  the  human  vocal  tract.  During  the  production  of  voiced  sounds, 
the  vocal  tract  is  excited  by  a series  of  quasi-periodic  pulses.  In 
the  case  of  unvoiced  sounds,  the  excitation  is  provided  by  air 
passing  through  a constriction  in  the  tract.  A simple  model  of  the 
vocal  tract  can  be  made  by  representing  it  as  a discrete  time 
varying  linear  filter.  The  underlying  assumption  in  most  speech 
processing  schemes  is  that  the  properties  of  the  speech  signal 
change  relatively  slowly  with  time.  Under  this  assumption  it  is 
possible  to  define  a transfer  function  in  the  complex  z-plane  for 
the  time  varying  filter,  or  the  vocal  tract. 

As  mentioned  earlier,  it  is  well  known  that  a good  model  for  a 
speech  generating  mechanism  is  an  all-pole  linear  system  driven  by 
an  impulse  train  or  white  noise.  The  block  diagram  of  a functional 
model  of  speech  production  based  on  the  linear  prediction 
representation  of  the  speech  waveform  is  shown  in  Figure  3-1.  The 
impulse  train  generator  provides  a train  of  unit  samples  whose 
spacing  is  T,  the  pitch  period.  The  multiplier,  G,  is  a gain 
control  that  varies  with  time  and  controls  the  intensity  of  the 
output.  In  this  model,  the  composite  spectral  effects  of  radiation, 
vocal  tract,  and  glottal  excitation  are  represented  by  a time 
varying  digital  filter  (all-pole)  whose  steady  state  system  function 
is  given  by 


H(  z ) 


S(z) 

U(z) 


" k=lakZ 


-k 


1 


(3.3) 
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Figure  3-1.  Block  diagram  of  a functional  model  for  speech  production 
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In  our  study,  we  will  consider  only  the  model  of  a voiced 
speech  signal  because  we  use  a sustained  vowel  phonation  for  the 
detection  of  laryngeal  pathology  and  for  speaker  identification. 
During  the  sustained  vowel  phonation,  the  vocal  tract  shape  changes 
very  little.  The  linear  model  of  voiced  speech  production  is  shown 
in  Figure  3-2.  One  major  assumption  of  this  model  is  the 
separability  of  the  filter  segments  during  the  generation  of  speech, 
i.e.,  there  is  no  source-tract  interaction.  In  this  figure,  the 
samples  of  the  speech  waveform  are  viewed  as  the  output  of  the 
discrete  time  system.  The  glottal  waveform  is  the  output  of  the 
glottal  shaping  filter,  G(z),  when  the  impulse  train,  u(n),  is 
driven.  A glottal  shaping  filter,  represented  by  several  all-pole 
models  based  on  the  physiologically  observed  waveform  [Flanagan, 
1972],  is  required  to  provide  the  appropriate  spectral  coloration. 
The  vocal  tract  filter,  V(z),  is  used  to  model  the  acoustical 
resonance  characteristics  between  the  glottis  and  the  lips.  It 
could  be  considered  as  a linear  quasi-time-invariant  discrete  system 
which  consists  of  a cascade  connection  of  digital  resonators  having 
different  frequencies,  which  are  also  called  formants.  The  relation 
between  the  volume  velocity  waveform  at  the  lips  and  the  sound 
pressure  waveform  in  free  field  is  modeled  by  R(z),  a lip  radiation 
filter.  Its  transfer  function  could  be  simplified  as  the  scaled 
derivative  of  the  sound  pressure  waveform  with  respect  to  the  lip 
volume  velocity  waveform  [Flanagan,  1972].  The  overall  linear  model 
of  speech  production  for  the  acoustical  sound  pressure  transform, 
S(z),  is  the  product  of  the  z-transform  of  the  lip  radiation  filter, 
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the  vocal  tract  filter,  the  glottal  shaping  filter,  and  the  source 
tract  excitation,  which  is  given  as 

S(z)  = G(z ) V(z)  L(z)  U(z)  (3.4) 

Determination  of  the  Predictor  Coefficients 

The  techniques  and  methods  of  linear  prediction  have  been 

available  in  the  engineering  literature  for  a long  time  [Makhoul, 

1975].  As  applied  to  speech  processing,  the  term  linear  prediction 

refers  to  a variety  of  essentially  equivalent  formulations  of  the 

problem  of  modeling  the  speech  waveform.  The  various,  often 

equivalent,  formulations  of  linear  prediction  analysis  applied  to 

speech  have  been:  [Itakura  and  Saito,  1968;  Wakita,  1973;  Chandra 

and  Lin,  1974;  Markel  and  Gray,  1976] 

. the  covariance  method 
. the  autocorrelation  method 
. the  lattice  method 
. the  inverse  filter  formulation 
. the  spectral  estimation  formulation 
. the  maximum  likelihood  formulation 
. the  inner  product  formulation 

In  this  section,  we  will  examine  in  detail  only  the  first  two  basic 
methods  of  analysis  listed  above.  All  the  other  formulations  are 
equivalent  to  one  of  the  first  three  methods. 

For  the  system  of  Figure  3-2,  the  speech  samples  s(n)  are 
related  to  the  excitation  u(n)  by  the  simple  difference  equation 
given  by 

s(n)  = ^^a^sfn-k)  + Gu(n) 


(3.5) 
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Its  transfer  function  becomes  an  all-pole  system  as  in  (3.3).  Given 
a particular  signal  s(n),  we  have  to  determine  the  predictor 
coefficients  and  the  gain  G. 

Here  we  assume  that  the  input  u(n)  is  typically  unknown.  Then 
the  signal  s(n)  can  be  predicted  only  approximately  from  a linearly 
weighted  summation  of  past  p samples.  Let  this  approximation  of 
s(n),  or  the  predicted  value,  be  s(n),  where 

s(n)  = j<.^aics(n-k)  (3.6) 

The  error  between  the  actual  value  s(n)  and  the  predicted  value  s(n) 
may  then  be  defined  as 

e(n)  = s(n)  - s(n)  = s(n)  - ^?^aics(n-k)  (3.7) 

where  e(n)  is  also  known  as  the  residue  signal.  Since  e(n) 
represents  the  error  between  the  actual  samples  and  the  predicted 
one,  it  would  seem  reasonable  to  choose  coefficients,  a^,  l<k<p>  so 
that  e(n)  is  minimized  in  some  manner.  With  the  method  of  least 
squares,  the  parameters  a^,  l<k<p  are  obtained  as  a result  of  the 
minimization  of  the  mean  or  total  squared  error  with  respect  to  each 
of  the  parameters  a^. 

The  main  reason  for  this  choice  of  optimization  criterion  is 
simply  that  the  resulting  equations  are  linear,  tractable,  and  they 
produce  excellent  results  in  the  analysis  of  speech.  If  we  denote 
the  total  squared  error  by  E,  then 

* nfn„e2<n)  ’ nfnjs(n)  ~ Jj^fn-k)]2 


E 
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, P .S  .? 

n=n0  1=0  j =o 

where  n0  and 
minimization  occurs. 


a^  s(n-i)  s(n-j)  aj 

define  the  index 
If  we  define 


, a0=-l  (3.8) 

limits  over  which  error 


cij  = n|n  s(n-i)  s(n-j)  (3.9) 

then  the  total  error  E can  be  equivalently  written  as 

E = .?  aiC^a-:  (3.10) 

i=o  j =o  1 iJ  J 

Equation  (3.10)  shows  that  the  total  squared  error  E is  in  quadratic 
form.  Minimization  of  E is  then  obtained  by  setting  the  partial 
derivation  of  E with  respect  to  a^,  l<k<p,  to  zero  and  then  solving 
the  resulting  equations.  From  (3.10),  we  get 

-|j-.  0 . 2-.|oaiCik  . l<k<P  (3-11) 

or  since  a0=-l 

i = laicik  = cok  ’ (3.12) 


The  p unknown  predictor  coefficients  a^,  l<i<p,  are  obtained  by 
solving  this  set  of  p linear  simultaneous  equations.  The  known 
parameters  Cj^,  0<i<p,  0<k<p,  are  determined  from  the  given  speech 
data  by  (3.9),  which  shows  that  the  samples  from  s(n0-p)  to  s(nj) 
are  required. 

We  shall  now  specify  the  range  of  summation  over  n in  (3.9). 
There  are  two  case  of  interest,  which  will  lead  to  two  distinct 
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methods  for  the  estimation  of  the  parameters.  These  are  referred  to 
as  the  autocorrelation  method  and  the  covariance  method.  The 
autocorrelation  method  is  defined  by  setting  nQ=-®  and  n^=®.  That 
is,  ve  assume  the  error  in  (3.8)  is  minimized  over  infinite  duration 
-»<n<®.  In  practice,  however,  the  signal  s(n)  is  known  only  within 
a finite  interval,  that  is,  we  are  only  interested  in  the  signal 
over  a finite  interval.  One  particular  method  is  to  multiply  the 
signal  s(n)  with  a window  function  w(n)  to  obtain  another  signal 
that  is  zero  outside  some  interval  0<n<N-l.  Detailed  discussions  of 
the  effects  of  windows  and  their  design  can  be  obtained  in  Oppenheim 
and  Shafer  [1975]  and  Childers  and  Durling  [1975]. 

The  simplest  form  of  w(n)  is  rectangular  window,  having  w(n)=l, 
0<n<N-l.  This  constraint  allows  c^j  to  be  simplified  as 
00 

ci j = nJ_oos (n-i ) s(n-j) 

00 

= n|_„s(n)  s(n+|  i-j  I ) 

N-l- | i-j | 

= nJ0  s(n)  s(n+| i-j | ) 


= r ( | i-j | ) (3.13) 

Equations  (3.12)  and  (3.13)  then  reduce  to 

lcllakr<i-lc)  = r(i)  * l<i<P  (3.14) 

where 

N-l-i 

r(i)  = Z s(n)  s(n+i)  , i >0  (3.15) 
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Since  the  coefficients  r(i-k)  form  what  often  is  known  as  an 
autocorrelation  matrix,  we  call  this  method  the  autocorrelation 
method.  An  autocorrelation  matrix  is  a Toeplitz  matrix,  i.e.,  it  is 
symmetric  and  all  the  elements  along  a given  diagonal  are  equal 
[Makhoul,  1975].  For  the  same  sequence  of  speech  samples  s(n), 
0<n<N-l,  the  covariance  method  is  defined  by  setting  n0=p  and  nj=N-l 
so  that  the  error  is  minimized  over  the  finite  interval  [Atal  and 
Hanauer,  1971;  Markel  and  Gray,  1976],  say,  p<n<N-l,  and  all  N 
speech  samples  are  used  in  calculating  the  correlation  terms,  cj j , 
given  in  (3.9).  The  covariance  method  reduces  to  the 
autocorrelation  method  as  the  interval  over  which  n varies  goes  to 
infinity. 


Factors  to  be  Considered 


Choice  of  analysis 

interval. 

The  choice 

of 

the 

analysis 

interval  includes  two 

factors , 

the  location 

of 

the 

interval 

(placement  with  respect  to  the  pitch  period)  and  the  length  of  the 
interval.  It  is  desirable  to  perform  spectral  analysis  within  an 
interval  where  vocal  tract  movement  is  negligible,  on  the  order  of 
15  - 20  msec  for  most  vowels.  Thus,  a reasonable  number  of  samples 
N,  used  in  the  analysis  interval,  is  given  by  fg,  sampling  frequency 
in  kHz,  multiplied  by  15  - 20  msec.  Arbitrary  placement  of  a 15  - 
20  msec  interval  will  not  substantially  affect  the  results  of  either 
the  covariance  method  or  the  autocorrelation  method  in  most 
instances,  and  generally  the  covariance  method  gives  less  prediction 
errors  than  the  autocorrelation  method  [Chandra  and  Lin,  1974]. 
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However,  if  the  covariance  method  is  applied  to  a voiced  speech  with 
a much  smaller  time  interval,  then  the  placement  of  the  interval 
becomes  critical.  Localized  placement  with  a single  pitch  period  is 
referred  to  as  pitch  synchronous  analysis.  Arbitrary  placement  of 
the  time  interval  is  referred  to  as  pitch  asynchronous  analysis. 

Preemphasis . When  the  vocal  tract  characteristics,  without  the 
effects  of  glottal  waveform  and  lip  radiation  characteristics,  are 
desired,  the  speech  signal  is  preemphasized  before  analysis  [Markel 
and  Gray,  1976].  The  basis  for  preemphasizing  the  speech  signal  to 
estimate  the  spectral  properties  of  the  vocal  tract  only  is  the 
experimental  results  showing  that  the  vocal  tract  area  functions 
estimated  from  the  speech  directly  have  no  apparent  relation  to 
actual  vocal  tract  behavior,  whereas  with  preemphasis  the  results 
appear  quite  reasonable  [Wakita,  1972]. 

A simple  way  to  effect  this  preemphasis  is  to  pass  the  signal 
through  a simple  one-zero  filter  of  the  form,  l-pz--*-,  where  p is 
near  or  equal  to  one.  The  one-zero  filter  for  preemphasis  can  be 
explained  with  the  assumptions  that  the  glottal  waveform  is  modeled 
as  a two-pole  filter  and  the  lip  radiation  characteristic  is  modeled 
as  a one-zero  filter.  Thus  if  we  cancel  the  glottal  shaping  filter 
G(z)  and  the  lip  radiation  filter  R(z),  still  one  pole  remains.  To 
eliminate  this  pole,  we  pass  the  speech  signal  through  a one-zero 
filter.  For  speech  analysis  the  value  of  p is  not  critical,  and 
value  in  the  range  from  0.9  to  1.0  yields  roughly  equivalent 


results . 
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Determination  of  prediction  order.  As  a practical  matter,  it 
is  generally  desirable  to  use  the  minimum  number  of  parameters 
necessary  to  accurately  model  the  significant  features  of  the 
signal.  In  the  modeling  of  speech,  these  features  are  the  vocal 
tract  resonances  or  formants  in  the  frequency  range  of  interest,  the 
radiation  effect,  and  the  glottal  characteristics.  Atal  and  Hanauer 
[1971]  demonstrated  that  in  order  to  adequately  represent  the  vocal 
tract  transfer  function,  the  linear  predictor  memory  must  be  equal 
to  twice  the  time  required  for  the  sound  waves  to  travel  from  the 
glottis  to  the  lips.  For  example,  for  an  average  17  cm  vocal  tract 
length  and  a velocity  of  sound  of  340  m/sec,  the  memory  should  be 
roughly  1 msec.  When  the  sampling  rate  is  10  kHz,  the  filter  order 
p must  be  at  least  ten.  As  the  glottal  filter  and  lip  radiation 
characteristics  have  not  been  accounted  for  in  the  above  model, 
these  numbers  could  be  applied  to  the  preemphasized  speech  signal. 
They  must  be  taken  as  lower  limits  for  the  speech  signal  with  no 
preemphasis . 

A series  of  synthesis  experiments  performed  by  Atal  and  Hanauer 
[1971]  indicated  that  in  most  cases  p equal  to  twelve  is  adequate 
for  a voiced  speech  when  the  sampling  is  10  kHz.  Markel  [1972] 
suggested,  based  on  experimental  results,  a reasonable  value  for 
formant  trajectory  estimation  is  fs  (the  sampling  frequency  in  kHz) 
plus  four  or  five.  In  summary,  it  is  thought  to  be  reasonable  to 
choose  p for  voiced  speech  analysis  as  the  sampling  frequency  in  kHz 
with  the  addition  of  several  more  terms  (from  two  to  five  depending 
upon  the  desired  results)  [Markel  and  Gray,  1976]. 
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A more  objective  criterion  for  the  determination  of  the  filter 
order  is  available  through  the  use  of  the  Akaiki  Final  Prediction 
Error  (FPE)  criterion  [Akaiki,  1970;  Ulrych  and  Bishop,  1975]. 
Smith  [1980]  examined  this  FPE  versus  LPC  filter  order  for  the 
sustained  vowel  sound  /i/  obtained  from  speakers  both  with  normal 
larynges  and  with  diagnosed  laryngeal  pathologies.  On  the  basis  of 
these  considerations,  he  selected  the  12th  order  LPC  model,  with  no 
preemphasis,  for  the  analysis  of  speech  signal. 

In  the  next  section,  we  will  explain  the  VQ  technique,  which  we 
shall  use  both  for  the  quantitative  measurement  of  spectral 
variation  and  for  speaker  identification. 

Vector  Quantization 

Introduction 

Data  compression  is  the  conversion  of  a stream  of  analog  or 
very  high  rate  discrete  data  into  a stream  of  relatively  low  rate 
data  for  a communication  channel  or  storage  in  a digital  memory. 
The  conversion  of  an  analog  signal  into  a digital  signal  consists  of 
two  parts:  sampling  and  quantization.  Sampling  converts  a 
continuous  time  signal  into  a discrete  time  signal  at  regular 
intervals  (sampling  interval)  of  time.  Quantization  converts  a 
continuous-amplitude  signal  into  one  of  a set  of  discrete-amplitude, 
thus  resulting  in  a discrete-amplitude  signal  that  is  different  from 
the  continuous-amplitude  signal  by  quantization  error  or  noise. 

When  each  of  a set  of  parameters  (or  a sequence  of  signal 
values)  is  quantized  separately,  the  process  is  known  as  scalar 
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quantization.  When  the  set  of  parameters  is  quantized  as  a single 
vector,  the  process  is  known  as  vector  quantization  (VQ).  In  other 
words,  vector  quantization  is  a process  of  mapping  a sequence  of 
discrete  vectors  into  a digital  sequence  suitable  for  communication 
over  a digital  channel  or  storage  in  a digital  medium. 

During  the  past  decade,  a number  of  design  algorithms  have  been 
developed  for  a variety  of  vector  quantizers  and  the  performance  of 
these  coders  has  been  studied  for  speech  waveforms,  speech  LPC 
vectors,  images,  and  several  simulated  random  process.  For  the 
purpose  of  speech  coding,  VQ  was  used  in  the  1950's  by  Dudley 
[1958].  However,  it  was  not  until  the  introduction  of  LPC  to  speech 
coding  that  VQ  has  had  significant  activity,  but  spurred  on  mainly 
by  the  work  of  [Buzo  et  al.,  1980]  and  [Linde  et  al.,  1980]. 
Because  of  the  relatively  large  computational  and  storage  costs  of 
VQ,  the  major  benefits  of  VQ  in  speech  coding  are  realized  largely 
at  a transmission  rate  of  1 bit  per  parameter  or  less,  which  is 
exactly  the  range  where  the  performance  of  scalar  quantizers 
degrades  sharply  [Makhoul  et  al.,  1985].  Today  very-low  rate  coding 
of  speech  remains  one  of  the  major  successful  applications  of  VQ  in 
speech  coding. 

Vector  quantization  technique  has  been  also  used  effectively  in 
pattern  recognition  types  of  speech  applications,  such  as  in  speech 
recognition  [Furui,  1988]  and  speaker  recognition  [Pan  et  al.,  1985; 
Soong  et  al.,  1987;  Soong  and  Rosenberg,  1988].  That  is,  the  VQ 
problem  is  part  of  the  general  pattern  recognition  problem  of  the 
classification  of  data  into  a discrete  number  of  categories  that 
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optimize  some  fidelity  criterion.  What  makes  VQ  distinctive  from 
other  classical  pattern  matching  approaches  is  the  mathematical 
foundation  of  the  distortion  measure  and  its  minimization  procedure. 

The  basic  concept  of  VQ  applied  to  speech  compression  is 
schematically  depicted  in  Figure  3-3.  In  this  figure,  xn  = 
(aj , a£ , • • • , a^) ' represents  an  N-dimensional  column  vector  whose 
components  {a^,  l<k<N}  are  real-valued  random  variables  having 
continuous  amplitude.  The  vector  input  xn  is  then  mapped  onto 
another  real-valued  discrete  N-dimensional  vector  y.  Typically,  y 
takes  one  of  a finite  set  of  values,  S={yj,  l<i<L}.  The  set  S,  or 
the  collection  of  possible  reproduction  vectors,  is  called  the 
reproduction  book  or,  simply,  codebook  of  the  quantizer,  and  L is 
the  size  of  the  codebook  and  its  members  are  called  codewords.  The 
size  L of  the  codebook  is  also  called  the  number  of  levels,  a term 
borrowed  from  scalar  quantization  terminology.  To  design  such  a 
codebook,  we  need  to  partition  the  N-dimensional  space  of  the  input 
vector  into  L regions  or  cells  {C^,  1 <i <L}  using  a large  number  of 
training  vectors.  Thus  the  process  of  codebook  design  is  also  known 
as  training  the  codebook. 

The  basic  operation  of  VQ  is  as  follows.  When  the  input  vector 
xn  comes  in,  the  distortion  (dissimilarity)  between  the  input  vector 
and  each  stored  codeword  is  computed.  The  encoded  output  is  then 
the  binary  representation  of  the  index  of  the  minimum  distortion 
codeword.  Since  we  represent  the  N-dimensional  input  vector  with 
simply  the  index  of  code  vector,  considerable  data  reduction  is 
achieved.  The  input  vectors  may  be  in  the  form  of  sampled  data,  FFT 
coefficients,  autocorrelation  terms  or  their  transformations.  Here 
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Average  distortion  is 
computed  at  this  stage 


Figure  3-3.  Schematic  diagram  of  the  vector  quantizer  for  speech 
compression 
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we  choose  LPC  parameters  as  input/training  vectors  because  they  are 
an  easily  obtainable  feature  vector  representing  spectral 
characteristics  of  utterances  and  are  very  suitable  for  vector 
quantization. 

In  this  study,  application  of  VQ  for  data  compression  is  not 
our  main  concern.  We  will  make  use  of  the  VQ  technique  in  two  ways. 
One  is  to  compute  the  average  distortion  of  all  the  vectors  used  for 
training  the  codebook  to  measure  the  spectral  variations  among  the 
given  vectors.  This  quantitative  measurement  will  be  examined  for 
the  detection  of  a laryngeal  pathology.  The  other  is  to  generate 
the  speaker-based  codebook  for  speaker  identification.  The  primary 
advantage  of  VQ,  in  this  case,  lies  in  the  codebook  approach  to 
characterization  of  the  speaker's  specific  features  and 
determination  of  the  similarity  between  the  test  input  and  reference 
utterances . 

Distortion  Measures 

A distortion  measure  d is  an  assignment  of  a nonnegative  cost 
d(x,y)  of  reproducing  any  input  vector  x as  a reproduction  vector  y. 
Given  such  a distortion  measure,  we  can  quantify  the  performance  of 
a system  by  an  average  distortion  between  the  input  and  the  final 
reproduction.  To  be  useful,  a distortion  measure  should  be 
tractable  and  computable  to  permit  analysis,  and  subjectively 
meaningful  so  that  large  or  small  quantitative  measures  correlate 
with  bad  or  good  subjective  quality.  Therefore,  the  choice  of  the 
distortion  measure  is  a key  component  in  the  VQ  technique.  There 
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are  many  such  distortion  measures  which  are  useful  in  speech 
analysis  [Nocerino  et  al.,  1985]. 

Of  interest  here  is  a distortion  measure  of  Itakura-Sai to  which 
arises  in  speech  compression  systems.  This  distortion  measure  is 
based  on  the  "error  matching  measure"  developed  in  the  pioneering 
work  of  Itakura  and  Saito  [1968]  on  the  LPC  approach  to  voice 
coding.  Therefore,  it  is  very  useful  in  voice  coding  applications 
when  the  speech  signal  is  encoded  with  LPC  parameters.  We  will  not 
deal  with  the  details  of  the  Itakura-Saito  distortion  measure  since 
they  may  be  found  in  [Gray  et  al.,  1980]  and  [Linde  et  al.,  1980]. 
For  simplicity  in  computation,  a modified  form  of  this  distortion 
measure  is  usually  used  in  VQ.  Between  two  LPC  vectors  x and  y,  it 


is  given  by 

d(x,y)  = (x-y)'  Rx  (x-y)  (3.16) 

where 

Rx  = [r(i-k)/r(0) , 0<i,k<N-l}  (3.17) 

is  the  autocorrelation  matrix  whose  coefficients  r(i-k)  were  used  in 
computing  the  vector  of  predictor  coefficients  x in  (3.16).  This 


distortion  resembles  the  form  of  the  quadratic  distortion  measure, 
but  here  the  weighting  matrix  Rx  depends  on  the  input  vector  x, 
while  the  quadratic  distortion  measure  does  not. 

If  we  compare  equations  (3.16)  and  (3.10),  we  can  find  that  the 
modified  Itakura-Saito  distortion  measure  has  the  same  form  as  the 
total  squared  prediction  error  in  LPC  with  the  autocorrelation 


method . 


This  means  that  the  Itakura-Saito  distortion  measure  is 


65 


consistent  with  residual  energy  minimization  inherent  in  LPC 
analysis.  This  is  a very  desirable  fact  for  VQ  with  LPC  parameters. 
We  extended  equation  (3.16)  for  the  pitch  synchronous  analyzed  LPC 
vector  with  the  covariance  method  as  follows.  To  satisfy  the  same 
condition  for  LPC  vectors  obtained  from  the  covariance  method,  the 
modified  Itakura-Sai to  distortion  should  have  the  same  form  as 
(3.10).  Thus  equation  (3.16)  becomes 

d(x,y)  = (x-y) ' Cx  (x-y)  (3.18) 

where  Cx  represents  the  covariance  matrix  of  the  input  signal  that 
are  used  to  get  an  input  LPC  vector  x.  Since  the  matrix  Cx  is  not  a 
Toeplitz  form,  normalizing  Cx  by  c(0)  is  thought  not  to  give  the 
correct  normalization  for  different  speakers.  Thus,  we  normalized 
the  input  speech  signal  directly  by  removing  the  mean  and  dividing 
with  a root  squared  mean  on  the  pitch  period  basis  before  obtaining 
the  LPC  coefficients. 

Codebook  Design 

In  order  to  design  an  L-level  codebook,  as  mentioned  earlier, 
we  have  to  partition  the  N-dimensional  space  into  L cells  {C^, 
l<i<L)  using  given  training  vectors.  We  associate  each  cell  Cj  with 
a vector  y^.  The  quantizer  then  assigns  the  code  vector  y^  if  the 
input  vector  x is  in  C^.  Each  code  vector  yj  is  chosen  to  minimize 
the  average  distortion  in  cell  C^.  We  call  such  a vector  the 
centroid  of  the  cell  Cj.  Computing  the  centroid  for  a particular 
region  will  depend  upon  the  definition  of  the  distortion  measure. 


66 


With  the  distortion  measure  of  (3.16),  the  centroid  is  given  by 
[Linde  et  al. , 1980] 

E Rx . Xj/ 

, for  all  i:  xj  e Cj  (3.19) 

i Xi 

where  x'  represents  the  column  vector  made  up  of  a set  of  LPC 
coefficients 

To  use  an  iterative  clustering  algorithm  to  design  the 
codebook,  we  need  initial  code  vectors.  There  are  several  ways  to 
choose  these.  One  method  for  use  on  sample  distributions  is  that  of 
the  K-means  method,  namely,  choosing  the  first  L vectors  in  the 
training  sequences  as  the  initial  code  vectors.  Another  method  is 
to  choose  the  initial  code  vector  by  computing  the  centroid  of  all 
the  given  training  vectors.  Then  the  splitting  technique  is  used 
until  we  get  the  desired  level  of  codebook.  The  splitting  technique 
for  codebook  design  is  an  ad  hoc  approach  not  guaranteed  to  be 
better  than  other  methods,  but  it  has  been  reported  to  give 
reasonably  satisfactory  results  [Buzo  et  al.,  1980].  We  used  this 
method  for  the  selection  of  the  initial  code  vector. 

For  the  given  training  LPC  vector  sequences,  the  codebook  is 
generated  as  follows. 

Step  1.  Initialization. 

Set  n=0  and  DaVg=10000  (arbitrarily  chosen  as  a large 
positive  number).  Fix  L=2n,  n an  integer  and  L as  the 
largest  number  of  levels  desired. 

Step  2.  Find  the  centroid  yQ  of  all  the  training  vectors  using 
equation  (3.19). 

This  yQ  is  used  as  an  initial  code  vector. 


Step  3.  Set  n=n+l 
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Step  4.  Split  all  the  centroids  by  perturbing  them  in  some  manner. 
Then,  partition  the  complete  set  of  training  vectors 
iteratively,  until  the  decrement  of  DaVg  at  each  iteration 
is  less  than  a predetermined  threshold,  in  such  a way  that 
the  average  distortion 

Davg  = 1 T X ^ciUi.yp  (3.20) 


is  minimized  over  the  whole  training  set,  i.e.,  M training 
vectors.  Where  the  distortion  between  vector  xj  and  yj  are 
denoted  as  d(xj,yj)  and  n represents  the  splitting  level. 

The  vector  yj  represents  the  centroid  of  each  partition  Cj , 
l<j<L,  and  it  becomes  the  codeword. 

Step  5.  If  the  desired  level  of  the  codebook  is  met,  i.e.,  2n>L, 
then  stop,  otherwise  go  back  to  step  4. 

At  step  4,  the  threshold  to  terminate  the  iteration  in  each  level  of 

codebook  is  defined  as 


THRESHOLD  = ^evious  Davg. ~„Current  D*vg 


Current  D 


(3.22) 


avg 


The  average  distortion  obtained  by  (3.19)  will  be  examined  for  the 
classification  of  normal  and  pathological  subjects  and  the  codebook 
generated  in  step  4 becomes  the  speaker-based  codebook  that  will  be 
used  for  speaker  identification. 


CHAPTER  FOUR 


DATA  COLLECTION  AND  PREPROCESSING 
Data  Collection 

Description  of  the  Computer  System 

The  data  collection  system  used  for  this  research  is  shown  in 
Figure  4-1.  The  major  building  blocks  of  this  system  are  an 
Electro-Voice  RE-10  microphone  to  convert  the  acoustical  pressure 
into  the  electrical  signal,  a Fourcin  type  electroglot tograph  to 
obtain  the  EGG  signal,  two  Digital  Sound  Corporation  DSC-240 
preamplifiers  to  record  and  to  play  both  signals,  a DSC-200 
digitizer  that  multiplexes  the  speech  and  EGG  signals  synchronously 
with  a sampling  rate  of  20  kHz  and  16-bits/sample,  a VAX  11/750 
computer  system  to  manage  all  of  the  processing  procedure,  an 
oscilloscope  to  monitor  both  speech  and  EGG  signals  when  we  collect 
the  data,  and  a speaker  to  monitor  the  digitized  speech  signal. 

Data  Base 

Two  groups  of  subjects  were  used  in  this  study,  subjects  with  a 
normal  larynx  and  subjects  with  a pathology  of  the  vocal  folds.  The 
total  number  of  subjects  we  collected  data  from  is  121,  including  92 
normal  subjects  and  29  subjects  with  a pathology.  The  subjects' 

ages  ranged  from  20  to  80. 
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Figure  4-1. 


Block  diagram  of  the  data  collection  system 
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Pathological  subjects'  data  base.  The  data  base  for  the 
pathological  subjects  consisted  of  voices  judged  as  deviant  by  the 
experts  in  the  speech  department.  Patients  from  the  University  of 
Florida  ENT  (Ear,  Nose,  and  Throat)  clinic  and  from  the  speech  and 
hearing  clinic  were  asked  to  voluntarily  serve  as  subjects  in  our 
research  program.  Each  subject  was  asked  to  utter  the  sustained 
vowel  /i/,  some  words,  and  several  standard  passages  at  a normal 
voice  level  in  an  IAC  (International  Sound  Corporation)  sound  booth. 
Then  both  the  speech  and  EGG  signals  of  each  subject  were  digitized 
synchronously  by  a DSC-200  digitizer  and  a VAX  11/750  computer 
system,  and  then  stored  on  disk  for  further  processing.  The 
pathological  group  consisted  of  23  voices  (8  males,  15  females)  of 
patient  with  vocal  disorders  and  6 voices  that  mimicked  several 
vocal  disorders.  Table  4-1  shows  the  list  of  pathological  subjects 
along  with  a description  of  their  pathologies.  The  range  of  voices 
varied  from  mildly  deviant  to  very  deviant. 

Normal  subjects'  data  base.  The  normal  subjects'  data  base 
consisted  of  two  different  groups.  One  is  the  population  of  52 
normal  subjects  (27  males,  25  females).  With  the  pathological 
group,  this  normal  group  was  used  to  help  extract  objective  features 
that  would  discriminate  normal  and  pathological  subjects.  The  other 
is  the  population  of  40  normal  subjects  (20  males,  20  females),  and 
used  for  speaker  identification  task.  The  data  base  for  this  group 
is  described  in  detail  in  Chapter  6. 
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Table  4-1.  List  of  pathological  subjects  with  description  of  their 
pathology 


Subject(Sex)  : 

dmh3 

(M) 

DMH^ 

(M) 

DMH-jj 

(M) 

DMH^ 

(M) 

gpm3 

(M) 

GPMiq 

(M) 

ABC 

(F) 

AHZ 

(F)  : 

CWG 

(F)  : 

C2G 

(F) 

C3G 

(F) 

CLW 

(F) 

EDR 

(M) 

E2R 

(M) 

JJS 

(M) 

JCY 

(F) 

JFH 

(M) 

JMS 

(M) 

J2S 

(M) 

JTO 

(M) 

MLM 

(F) 

MXN 

(F) 

NUK 

(F) 

N2K 

(F) 

N3K 

(F) 

PJB 

(F) 

P2B 

(F) 

STR 

(M) 

TLW 

(F)  : 

Description 

mimicked  vocal  fry  voice 
mimicked  breathy  voice 
mimicked  hoarse  voice 
mimicked  rough  voice 
mimicked  breathy  voice 
mimicked  rough  voice 

post  nodule  voice  disorder,  mild  hoarseness 
occasionally 

hyperfunctional  voice  disorder 

left  vocal  fold  paralysis,  cancer  of  thyroid, 

thyroid  removed 

1 month  post  of  subject  CVG 

3 month  post  collagen  injection  of  subject  CWG 
enlarged  vocalus  muscle 

bilateral  vocal  folds  paralysis,  collagen  injection, 
breathiness  is  major  symptom 
1 month  post  of  subject  EDR 

cancer  on  the  left  vocal  fold,  severe  hoarseness 

posterior  cyst,  severe  hoarseness 

contact  ulcer  on  the  left  vocal  fold 

hoarse,  incomplete  glottal  closure 

second  visit  of  subject  JMS 

normal  larynx,  consistent  vocal  fry 

nodules 

right  vocal  fold  paralysis  with  multiple  teflon 
injection 

unilateral  paralysis,  collagen  injection 
1 month  post  of  subject  NUK 
third  visit  of  subject  NUK 
preinjection,  weak  and  breathy 
1 month  post  injection  of  subject  PJB 
cancer  of  thyroid,  tissue  removed,  bilateral 
vocal  folds  paralysis 
bilateral  nodules 
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Demultiplexing  and  Trimming  the  Data 

The  collected  data  are  2-channel  (Speech  and  EGG)  multiplexed 
signals  sampled  at  20  kHz,  and  contain  a large  portion  of  silence  at 
the  beginning  and  ending  of  each  file.  We  demultiplexed  the  signal 
to  obtain  both  the  speech  and  EGG  signals  sampled  at  10  kHz 
respectively,  and  trimmed  the  demultiplexed  signal  of  unnecessary 
silence  at  the  beginning  and  ending  of  each  utterance  to  save  the 
computer  memory  space. 

In  summary,  the  data  base  we  used  has  the  following  properties: 

. 29  pathological  subjects  (6  mimicked,  23  real  patients) 

. 52  normal  subjects  (27  males,  25  females) 

. 40  normal  subjects  (20  males,  20  females)  for  speaker 
identification  task 
. speakers'  ages  ranged  from  20  to  80 

. recordings  were  made  in  an  IAC  sound  booth  with  Electro-Voice 
RE-10  microphone  (dynamic  cardioid)  held  approximately  6 
inches  from  the  lips 

. synchronous  2-channel  data  (Speech  and  EGG) 

. speech  task  was  the  sustained  vowel  / i / 

Examples  of  the  LPC  Spectrum 
Selection  of  the  LPC  Order 

In  Chapter  3,  we  discussed  the  determination  of  prediction 
order  in  the  speech  signal  analysis.  The  discussion  suggested  that, 
when  the  sampling  frequency  is  10  KHz,  an  order  of  12  is  adequate 
for  voiced  speech  with  no  preemphasis  [Markel,  1972;  Smith,  1980]. 
For  the  speech  signal  with  preemphasis,  however,  the  order  can  be 
reduced  since  the  glottal  waveform  and  lip  radiation  characteristics 
are  approximately  eliminated  in  the  analysis  [Markel  and  Gray, 
1976].  Therefore,  we  selected  the  order  of  the  LPC  as  10  for  the 
preemphasized  speech  signal  using  the  Hamming  window. 
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Normal  vs.  Pathological  Subjects 

The  examples  of  the  LPC  spectral  envelopes  for  the  sustained 
vowel  / i / , produced  by  two  typical  normal  subjects,  DMH  and  BLC, 
are  shown  in  Figure  4-2.  The  10th  order  LPC  coefficients  for  the 
consecutive  50  frames  were  obtained  from  the  pitch  asynchronous 
analysis  using  the  autocorrelation  method.  The  analysis  frame  size 
was  256  samples  and  the  size  of  the  moving  frame  was  64  samples. 
The  speech  signal  was  preemphasized  and  the  Hamming  window  was  used. 
The  LPC  spectral  envelopes  for  the  mimicked  pathological  voices, 
produced  by  DMH,  are  shown  in  Figure  4-3.  The  same  kinds  of 
spectral  envelopes  for  the  real  patients,  STR  and  CWG,  are  shown  in 
Figure  4-4.  From  Figures  4-2  to  4-4,  it  is  clear  that  the 
pathological  voices  show  more  variations  in  spectral  envelopes  than 
the  normal  voices.  However,  Figure  4-3  illustrates  that  the  breathy 
type  voice  can  have  somewhat  stable  spectral  envelopes  (showing  less 
variations)  compared  to  the  hoarse  type  voice. 

Some  examples  of  the  pitch  synchronous  LPC  spectra  for  the 
normal  and  pathological  subjects  are  shown  in  Figure  4-5.  The 
speech  signal  was  preemphasized  and  Hamming  windowed  on  the  pitch 
period  basis.  The  frame  size  of  a single  pitch  period  was  chosen 
between  two  consecutive  differentiated  EGG  minima.  In  the  pitch 
synchronous  LPC  analysis  (covariance  method)  for  the  frame  size  of 
one  pitch  period,  using  a Hamming  window  resulted  in  more  reliable 
spectral  estimation.  The  experimental  results  for  the  synthetic 
speech  signals  are  given  in  Appendix  A. 
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(a) 


MALE  VOICE 


23 

U 

■i 

-33 

-Z 


(b) 


Figure  4-2.  LPC  spectral  envelopes  of  the  sustained  vowel  / i / for 
normal  subjects 

(10th  order  pitch  asynchronous  analysis,  Hamming  window) 

(a)  Subject  : DMH 

(b)  Subject  : BLC 


75 


(a) 


(b) 


Figure  4-3.  LPC  spectral  envelopes  of  the  sustained  vowel  / 
mimicked  pathological  subjects 
(10th  order  pitch  asynchronous  analysis,  Hammin 

(a)  Subject  : DMHjq 

(b)  Subject  : DMH^y 


i/  for 
g window) 
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(a) 


(b)  

SUBJECT : CWG 


Figure  4-4.  LPC  spectral  envelopes  of  the  sustained  vowel  / i / for  real 
pathological  subjects 

(10th  order  pitch  asynchronous  analysis,  Hamming  window) 

(a)  Male  subject  : STR 

(b)  Female  subject  : CWG 


OMPLITUDE  ( DB ) 


77 


(a)  (b) 


(c)  (d) 


Figure  4-5.  LPC  spectra  of  the  sustained  vowel  /i/  for  three  male 

subject  (10th  order  pitch  synchronous  analysis,  Hamming 
window).  The  spectra  for  4 LPC  vectors  are  superimposed 
for  each  subject. 

(a)  Subject  : TDC  (normal  male  voice) 

(b)  Subject  : MCG  (normal  female  voice) 

(c)  Subject  : DMH-^q  (mimicked  breathy  voice) 

(d)  Subject  : MBK  (normal  female  voice) 
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Examples  of  VQ  with  LPC  Vectors 
Selection  of  Codebook  Size 

In  this  research,  the  VQ  technique  was  used  to  compute  the 
average  I takura-Sai to  distortion  for  the  detection  of  voice 
disorders  as  well  as  to  get  the  speaker-based  VQ  codebook  for 
speaker  identification.  The  average  distortion  for  each  subject  can 
change  considerably  depending  upon  the  size  of  the  codebook.  The 
effects  of  codebook  size  in  speaker  identification  are  described  in 
detatil  in  Chapter  6. 

For  the  detection  of  voice  disorders  the  size  of  the  codebook 
should  be  chosen  to  give  the  maximum  separation  of  the  distortion 
values  between  the  normal  and  pathological  groups.  From  100  LPC 
vectors  for  each  subject,  obtained  from  the  pitch  asynchronous 
analysis  using  the  autocorrelation  method,  we  computed  the  average 
spectral  distortion  for  various  codebook  sizes.  The  examples  of  the 
VQ  analysis  results  for  a typical  normal  and  a typical  pathological 
subjects  are  shown  in  Tables  4-2  and  4-3.  We  can  see  that  the 
average  distortion  of  the  pathological  subject  is  greater  than  that 
of  the  normal  subject  for  the  same  size  of  the  codebook. 

To  determine  the  VQ  codebook  size,  we  investigated  the 
distortion  values  for  the  codebook  size  from  one  to  sixteen  for  30 
subjects.  These  subjects  consisted  of  6 mimicked  pathological 
subjects,  9 patients  having  voice  disorders,  and  15  normal  subjects. 
Table  4-4  shows  the  list  of  these  subjects  along  with  their  VQ 
analysis  results.  Using  the  distortion  for  each  codebook  size, 
given  in  Table  4-4,  we  did  the  normal/pathological  classification 
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Table  4-2.  Example  of  VQ  analysis  for  normal  subject 
(Pitch  asynchronous  LPC  analysis) 


L = 

1 

# 

OF 

ITERATION 

= 

1 

DISTORTION 

= 

0.0017847 

L = 

2 

# 

OF 

ITERATION 

_ 

1 

DISTORTION 

_ 

0.0017838 

L = 

2 

# 

OF 

ITERATION 

= 

2 

DISTORTION 

= 

0.0014426 

L = 

2 

# 

OF 

ITERATION 

= 

3 

DISTORTION 

= 

0.0008894 

L = 

2 

# 

OF 

ITERATION 

= 

4 

DISTORTION 

= 

0.0008843 

L = 

2 

# 

OF 

ITERATION 

= 

5 

DISTORTION 

= 

0.0008833 

L = 

2 

# 

OF 

ITERATION 

= 

6 

DISTORTION 

= 

0.0008833 

L = 

4 

# 

OF 

ITERATION 

_ 

1 

DISTORTION 

a 

0.0008825 

L = 

4 

# 

OF 

ITERATION 

= 

2 

DISTORTION 

0.0007107 

L = 

4 

# 

OF 

ITERATION 

= 

3 

DISTORTION 

= 

0.0005418 

L = 

4 

# 

OF 

ITERATION 

= 

4 

DISTORTION 

= 

0.0005250 

L = 

4 

# 

OF 

ITERATION 

= 

5 

DISTORTION 

= 

0.0005240 

L = 

4 

# 

OF 

ITERATION 

= 

6 

DISTORTION 

= 

0.0005233 

L = 

4 

# 

OF 

ITERATION 

= 

7 

DISTORTION 

= 

0.0005233 

L = 

8 

# 

OF 

ITERATION 

— 

1 

DISTORTION 

_ 

0.0005226 

L = 

8 

# 

OF 

ITERATION 

= 

2 

DISTORTION 

= 

0.0004746 

L = 

8 

# 

OF 

ITERATION 

= 

3 

DISTORTION 

= 

0.0003708 

L = 

8 

# 

OF 

ITERATION 

= 

4 

DISTORTION 

= 

0.0003354 

L = 

8 

# 

OF 

ITERATION 

= 

5 

DISTORTION 

= 

0.0003255 

L = 

8 

# 

OF 

ITERATION 

= 

6 

DISTORTION 

= 

0.0003255 

L = 

16 

# 

OF 

ITERATION 

— 

1 

DISTORTION 

= 

0.0003248 

L = 

16 

# 

OF 

ITERATION 

= 

2 

DISTORTION 

= 

0.0002730 

L = 

16 

# 

OF 

ITERATION 

= 

3 

DISTORTION 

= 

0.0002445 

L = 

16 

# 

OF 

ITERATION 

= 

4 

DISTORTION 

= 

0.0002272 

L = 

16 

# 

OF 

ITERATION 

= 

5 

DISTORTION 

= 

0.0002181 

L = 

16 

# 

OF 

ITERATION 

= 

6 

DISTORTION 

= 

0.0002162 

L = 

16 

# 

OF 

ITERATION 

= 

7 

DISTORTION 

= 

0.0002114 

L = 

16 

# 

OF 

ITERATION 

= 

8 

DISTORTION 

= 

0.0002075 

L = 

16 

# 

OF 

ITERATION 

= 

9 

DISTORTION 

= 

0.0002065 

L = 

16 

# 

OF 

ITERATION 

= 

10 

DISTORTION 

0.0002065 

FILE 

NAME 

OF 

INPUT  VECTOR 

. 

DMHN003 . LA 

FILE 

NAME 

OF 

GENERATED  CODEBOOK  : 

DMHN003 . CB 

NUMBER  OF  TRAINING  SEQUENCES  : 100 

THE  MAX.  CODEBOOK  SIZE  (L)  : 16 

THRESHOLD  TO  TERMINATE  ITERATION  : 0.0001 
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Table  4-3.  Example  of  VQ  analysis  for  pathological  subject 
(Pitch  asynchronous  LPC  analysis) 


L 

= 

1 

# 

OF 

ITERATION 

= 

1 

DISTORTION 

= 

0.0259500 

L 

_ 

2 

# 

OF 

ITERATION 

= 

1 

DISTORTION 

0.0259321 

L 

= 

2 

# 

OF 

ITERATION 

= 

2 

DISTORTION 

= 

0.0188476 

L 

= 

2 

# 

OF 

ITERATION 

= 

3 

DISTORTION 

= 

0.0183196 

L 

= 

2 

# 

OF 

ITERATION 

= 

4 

DISTORTION 

= 

0.0183196 

L 

_ 

4 

# 

OF 

ITERATION 

= 

1 

DISTORTION 

_ 

0.0183052 

L 

= 

4 

# 

OF 

ITERATION 

= 

2 

DISTORTION 

= 

0.0130153 

L 

= 

4 

# 

OF 

ITERATION 

s 

3 

DISTORTION 

= 

0.0125840 

L 

= 

4 

# 

OF 

ITERATION 

= 

4 

DISTORTION 

= 

0.0125455 

L 

= 

4 

# 

OF 

ITERATION 

= 

5 

DISTORTION 

= 

0.0124667 

L 

= 

4 

# 

OF 

ITERATION 

= 

6 

DISTORTION 

= 

0.0123961 

L 

= 

4 

# 

OF 

ITERATION 

= 

7 

DISTORTION 

= 

0.0123343 

L 

= 

4 

# 

OF 

ITERATION 

= 

8 

DISTORTION 

= 

0.0122227 

L 

= 

4 

# 

OF 

ITERATION 

= 

9 

DISTORTION 

= 

0.0121603 

L 

= 

4 

# 

OF 

ITERATION 

= 

10 

DISTORTION 

= 

0.0121497 

L 

= 

4 

# 

OF 

ITERATION 

= 

11 

DISTORTION 

= 

0.0121321 

L 

= 

4 

# 

OF 

ITERATION 

= 

12 

DISTORTION 

= 

0.0121321 

L 

_ 

8 

# 

OF 

ITERATION 

= 

1 

DISTORTION 

0.0121200 

L 

= 

8 

# 

OF 

ITERATION 

= 

2 

DISTORTION 

= 

0.0098546 

L 

= 

8 

# 

OF 

ITERATION 

= 

3 

DISTORTION 

= 

0.0090500 

L 

= 

8 

# 

OF 

ITERATION 

= 

4 

DISTORTION 

= 

0.0089586 

L 

= 

8 

# 

OF 

ITERATION 

= 

5 

DISTORTION 

= 

0.0088768 

L 

= 

8 

# 

OF 

ITERATION 

= 

6 

DISTORTION 

= 

0.0087864 

L 

= 

8 

# 

OF 

ITERATION 

= 

7 

DISTORTION 

= 

0.0087370 

L 

= 

8 

# 

OF 

ITERATION 

= 

8 

DISTORTION 

= 

0.0087248 

L 

= 

8 

# 

OF 

ITERATION 

= 

9 

DISTORTION 

= 

0.0087248 

FILE  NAME  OF  INPUT  VECTOR  : MXNP003 . LA 
FILE  NAME  OF  GENERATED  CODEBOOK  : MXNP003 . CB 
NUMBER  OF  TRAINING  SEQUENCES  : 100 

THE  MAX.  CODEBOOK  SIZE  (L)  : 8 

THRESHOLD  TO  TERMINATE  ITERATION  : 0.0001 
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Table  4-4.  List  of  30  subjects  along  with  VQ  analysis  result 


No. 

Name 

Sex 

SDI 

SD2 

SD4 

SD8 

SD16 

Pathological  subjects: 

1. 

ABC 

F 

1.36113 

.80627 

.58565 

.35518 

.23194 

2. 

dmh3 

M 

1.54823 

.66375 

.37496 

.22420 

.14903 

3. 

DMH]_q 

M 

.92323 

.73273 

.57575 

.42873 

.32819 

4. 

DMHjj 

M 

1.81956 

1.43444 

1.13054 

.85552 

.62164 

5. 

DMH31 

M 

.97512 

.79702 

.64421 

.48629 

.35718 

6. 

gpm3 

M 

1.40312 

1.09747 

.90111 

.73441 

.51067 

7. 

GPMio 

M 

1.07540 

.89512 

.71119 

.50942 

.38017 

8. 

JCY 

F 

.66739 

.48848 

.36610 

.27359 

.18345 

9. 

JFH 

M 

1.50677 

.82332 

.56737 

.37384 

.2039 

10. 

JMS 

M 

1.36417 

1.02301 

.73746 

.49071 

.30444 

11. 

JTO 

M 

.61962 

.17118 

.07262 

.03425 

.01911 

12. 

NUK 

F 

2.81766 

1.57664 

1.23997 

.97647 

.71236 

13. 

PJB 

F 

3.13400 

2.25665 

1.50758 

1.09619 

1.37586 

14. 

TLW 

F 

1.03888 

.83958 

.67075 

.52169 

.35547 

15. 

STR 

M 

.98186 

.75723 

.52721 

.39886 

.44940 

Normal  subjects: 

16. 

AAA 

M 

1.73743 

1.05913 

.59354 

.40880 

.26408 

17. 

BEM 

F 

.58660 

.46755 

.35288 

.26897 

.18247 

18. 

CXO 

F 

.77016 

.46271 

.32892 

.22915 

.15102 

19. 

DDH 

F 

1.08931 

.66823 

.40775 

.24643 

.13079 

20. 

DGC 

M 

1.20014 

.60906 

.43135 

.26741 

.17396 

21. 

DMH 

M 

.17847 

.08833 

.05233 

.03255 

.02065 

22. 

DMP 

F 

.55876 

.38426 

.25390 

.16176 

.10197 

23. 

DRW 

M 

1.03323 

.35802 

.23296 

.15313 

.09724 

24. 

DVD 

F 

.28521 

.23371 

.10716 

.06957 

.04856 

25. 

GPM 

M 

1.15431 

.75655 

.50433 

.34475 

.23196 

26. 

HBR 

M 

1.25865 

.76943 

.34776 

.20240 

.11161 

27. 

JRS 

M 

.22834 

.15976 

.08447 

.04608 

.02883 

28. 

KCM 

F 

.49430 

.36447 

.27007 

.18781 

.11339 

29. 

LAD 

F 

.78999 

.47529 

.19342 

.13813 

.07990 

30. 

SEN 

M 

1.49109 

.49185 

.19768 

. 11591 

.07360 

* SDi  represents  an  average  LPC  distortion  for  the  codebook  size  "i" 
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tests.  Figure  4-6  shows  plots  of  the  classification  error  versus 
the  codebook  size  for  both  closed  threshold  test  and  discriminant 
analysis  with  a "leave-one-out"  method.  It  can  be  seen  that 
increasing  the  codebook  size  to  be  greater  than  four  does  not  reduce 
the  total  number  of  classification  errors.  Considering  that  a 
reasonable  number  of  training  vectors  for  each  centroid  is  at  least 
10  or  20  [Buzo  et  al.,  1980]  as  well  as  the  analysis  result  shown  in 
Figure  4-6,  we  selected  the  codebook  size  to  be  equal  to  four. 

Normal  vs.  Pathological 

The  superimposed  LPC  spectra  for  100  LPC  vectors  used  for 
training  the  VQ  codebook  in  Table  4-2,  and  the  LPC  spectra  of  four 
centroids  are  shown  in  Figure  4-7.  The  minimum  average  distortion 
between  the  spectra  of  the  codebook  and  the  spectra  of  the  input 
training  vectors  is  computed  using  the  modified  form  of  Itakura- 
Saito  distortion  measure,  defined  in  (3.20)  and  (3.16).  The 
distortions  for  the  codebook  sizes  of  one  and  four  for  30  subjects 
we  analyzed  are  shown  in  Figure  4-8.  We  can  see  that  the  average 
distortion  for  the  codebook  size  of  four  shows  good  separation 
between  the  normal  and  pathological  groups. 

The  examples  of  the  VQ  analysis  results  with  the  pitch 
synchronous  analyzed  LPC  vectors  for  two  typical  normal  subjects  are 
shown  in  Tables  4-5  and  4-6.  The  LPC  spectra  for  their  codebooks 
are  shown  in  Figure  4-9. 
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number 

number 

number 


of  misclassif ied 
of  misclassif ied 
of  total  errors 


pathological  subjects 
normal  subjects 


Figure  4-6.  Classification  errors  versus  the  codebook  size  (N=30) 

(a)  Closed  threshold  test 

(b)  Discriminant  analysis  using  the  "leave-one-out"  method 
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(b) 


Figure  4-7. 


(a) 

(b) 


LPC  spectra 
(10th  order 
Spectra  for 
LPC  spectra 


of  the  sustained  vowel  / i / for  subject  DMH 
pitch  asynchronous  analysis,  Hamming  window) 
100  input  (10th  order)  LPC  vectors 
of  the  4 centroids  obtained  from  VQ  analysis 


DISTORTION  x 0.01 


85 


L : LPC  spectral  distortion  with  one  centroid 
$ : LPC  spectral  distortion  with  four  centroids 

Subject  number  1-15  : Pathological  subjects 
16-30  : Normal  subjects 


Figure  4-8. 


Example  of  LPC  distortion  analysis  for  normal  and 
pathological  subjects  (Pitch  asynchronous  analysis) 
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(b) 


Figure  4-9.  LPC  spectra  of  the  VQ  codebook  for  two  normal  subjects 
(10th  order  pitch  synchronous  analysis  for  vowel  / i / , 
Hamming  window) 

(a)  Male  subject  : AWB 

(b)  Female  subject  : VKR 
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Table  4-5.  Example  of  VQ  analysis  for  normal  male  subject 
(Pitch  synchronous  LPC  analysis) 


L 

= 

1 

# 

OF 

ITERATION 

= 

1 

DISTORTION  = 

0.0003712 

L 

_ 

2 

# 

OF 

ITERATION 

_ 

1 

DISTORTION  = 

0.0003710 

L 

= 

2 

# 

OF 

ITERATION 

=r 

2 

DISTORTION  = 

0.0003196 

L 

= 

2 

# 

OF 

ITERATION 

= 

3 

DISTORTION  = 

0.0002822 

L 

= 

2 

# 

OF 

ITERATION 

= 

4 

DISTORTION  = 

0.0002704 

L 

= 

2 

# 

OF 

ITERATION 

= 

5 

DISTORTION  = 

0.0002648 

L 

= 

2 

# 

OF 

ITERATION 

= 

6 

DISTORTION  = 

0.0002577 

L 

= 

2 

# 

OF 

ITERATION 

= 

7 

DISTORTION  = 

0.0002504 

L 

= 

2 

# 

OF 

ITERATION 

= 

8 

DISTORTION  = 

0.0002478 

L 

= 

2 

# 

OF 

ITERATION 

= 

9 

DISTORTION  = 

0.0002468 

L 

- 

2 

# 

OF 

ITERATION 

= 

10 

DISTORTION  = 

0.0002458 

L 

= 

2 

# 

OF 

ITERATION 

= 

11 

DISTORTION  = 

0.0002448 

L 

= 

2 

# 

OF 

ITERATION 

= 

12 

DISTORTION  = 

0.0002437 

L 

= 

2 

# 

OF 

ITERATION 

= 

13 

DISTORTION  = 

0.0002436 

L 

= 

2 

# 

OF 

ITERATION 

= 

14 

DISTORTION  = 

0.0002436 

L 

= 

4 

# 

OF 

ITERATION 

_ 

1 

DISTORTION  = 

0.0002433 

L 

= 

4 

# 

OF 

ITERATION 

= 

2 

DISTORTION  = 

0.0001948 

L 

= 

4 

# 

OF 

ITERATION 

3 

DISTORTION  = 

0.0001722 

L 

= 

4 

# 

OF 

ITERATION 

= 

4 

DISTORTION  = 

0.0001651 

L 

= 

4 

# 

OF 

ITERATION 

= 

5 

DISTORTION  = 

0.0001643 

L 

= 

4 

# 

OF 

ITERATION 

= 

6 

DISTORTION  = 

0.0001640 

L 

— 

4 

# 

OF 

ITERATION 

= 

7 

DISTORTION  = 

0.0001640 

FILE 

NAME 

OF 

INPUT  VECTOR 

TDCN003 

.CLA 

FILE  NAME  OF  GENERATED  CODEBOOK  : 
NUMBER  OF  TRAINING  SEQUENCES  : 

THE  MAX.  CODEBOOK  SIZE  (L)  : 4 

THRESHOLD  TO  TERMINATE  ITERATION 


TDCN003 . CCB 
100 

: 0.0001 
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Table  4-6.  Example  of  VQ  analysis  for  normal  female  subject 
(Pitch  synchronous  LPC  analysis) 


L 

= 

1 

# 

OF 

ITERATION 

= 

1 

DISTORTION  = 

0.0002313 

L 

_ 

2 

# 

OF 

ITERATION 

_ 

1 

DISTORTION  = 

0.0002312 

L 

= 

2 

# 

OF 

ITERATION 

= 

2 

DISTORTION  = 

0.0001854 

L 

= 

2 

# 

OF 

ITERATION 

= 

3 

DISTORTION  = 

0.0001495 

L 

= 

2 

# 

OF 

ITERATION 

= 

4 

DISTORTION  = 

0.0001473 

L 

= 

2 

# 

OF 

ITERATION 

= 

5 

DISTORTION  = 

0.0001467 

L 

= 

2 

# 

OF 

ITERATION 

= 

6 

DISTORTION  = 

0.0001465 

L 

= 

2 

# 

OF 

ITERATION 

= 

7 

DISTORTION  = 

0.0001465 

L 

s 

2 

# 

OF 

ITERATION 

= 

8 

DISTORTION  = 

0.0001465 

L 

= 

4 

# 

OF 

ITERATION 

_ 

1 

DISTORTION  = 

0.0001464 

L 

= 

4 

# 

OF 

ITERATION 

= 

2 

DISTORTION  = 

0.0001173 

L 

= 

4 

# 

OF 

ITERATION 

= 

3 

DISTORTION  = 

0.0001097 

L 

= 

4 

# 

OF 

ITERATION 

= 

4 

DISTORTION  = 

0.0001087 

L 

= 

4 

# 

OF 

ITERATION 

= 

5 

DISTORTION  = 

0.0001076 

L 

= 

4 

# 

OF 

ITERATION 

= 

6 

DISTORTION  = 

0.0001063 

L 

= 

4 

# 

OF 

ITERATION 

= 

7 

DISTORTION  = 

0.0001060 

L 

= 

4 

# 

OF 

ITERATION 

= 

8 

DISTORTION  = 

0.0001057 

L 

= 

4 

# 

OF 
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CHAPTER  FIVE 


ASSESSMENT  OF  LARYNGEAL  FUNCTION  FROM  SPEECH  AND  EGG  SIGNALS 

Introduction 

One  important  aspect  of  this  research  was  to  develop 
quantitative  measures  for  the  evaluation  of  laryngeal  function.  As 
we  mentioned  earlier  in  Chapter  1,  we  used  two  approaches  for  this 
analysis.  One  is  analyzing  the  LPC  spectral  variations  of  the 
speech  signal  using  the  VQ  technique.  The  other  is  analyzing  the 
perturbations  of  various  segments  and  events  of  the  EGG  signal  at 
various  positions  within  a cycle. 

In  Chapter  4,  we  showed  that  the  average  I takura-Sai to 
distortion  measure  when  used  with  VQ  could  be  an  indicator  for  the 
detection  of  a laryngeal  pathology.  In  the  EGG  signal  analysis,  a 
set  of  time  interval  and  amplitude  difference  measurements  were 
calculated  from  each  subject's  EGG  waveform  serving  as  the  basis  of 
comparison  of  results  across  subjects  (normal  vs.  pathological 
voices).  Ve  also  calculated  the  variations  in  the  pitch  period  and 
in  the  amplitude  of  the  EGG  waveform.  The  measurements  were 
compared  across  subjects. 

The  data  base  we  used  for  the  detection  of  voice  disorders 
consisted  of  29  pathological  subjects  and  52  normal  subjects.  In 
order  to  examine  the  effectiveness  of  the  selected  features  for  the 
classification  of  normal  and  pathological  groups,  we  investigated 
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the  histograms  of  each  feature  in  each  group,  and  correlation 
matrices  for  each  group  with  the  mean  and  standard  deviation  of  each 
feature.  We  also  did  a classification  test  for  the  normal  and 
pathological  subjects. 

This  chapter  is  organized  in  the  following  manner.  We  first 
describe  the  experimental  procedures  for  the  measurement  and 
analysis  of  the  various  parameters  we  defined.  Then  we  present  the 
measurement  and  analysis  results  made  from  our  data  base.  Finally 
we  discuss  our  findings. 

Experimental  Procedure 
Acoustic  Signal  Analysis 

Pitch  asynchronous  LPC  analysis  by  the  autocorrelation  method. 

Figure  5-1  shows  the  experimental  procedure  for  analyzing  the  speech 

signal  to  compute  the  average  spectral  distortion.  First  we 

selected  stable  segments  of  the  speech  and  EGG  signals  from  the  data 

file  of  each  subject.  Then  the  LPC  analysis  using  the 

autocorrelation  method  was  performed  with  the  following  conditions: 

Filter  order  : 10  coefficients 

Analysis  frame  : 256  samples/frame 

Frame  overlap  : 192  samples  (3/4  frame  overlap) 

Analysis  window  : Hamming  window 
Speech  preemphasis  factor  : 0.9 

Data  set  for  coefficient  calculations  : 100  consecutive  frames 
From  the  LPC  coefficients  and  autocorrelation  functions  for  each 
subject,  the  average  Itakura-Saito  spectral  distortion,  defined  in 
(3.20)  and  (3.16),  was  computed  using  the  VQ  algorithm,  described  in 
Chapter  3,  with  the  following  conditions: 
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Figure  5-1.  Experimental  procedure  for  LPC  distortion  analysis 
using  vector  quantization 
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Size  of  the  codebook  (number  of  centroids)  : 4 
Threshold  to  terminate  iteration  : 0.0001 
Number  of  training  vectors  : 100 

The  average  distortion  with  four  centroids,  SD4,  for  the  normal  and 

pathological  groups  was  examined  for  further  analysis.  The  average 

distortion  with  one  centroid,  SD1,  represents  the  overall  spectral 

variations  for  the  given  speech  samples.  We  also  examined  SD1  for 

both  groups  to  compare  it  with  SD4. 

Pitch  synchronous  LPC  analysis  using  the  covariance  method. 

The  LPC  coefficients  obtained  by  the  pitch  synchronous  analysis  are 

parametric  representations  of  the  speech  signal  on  the  pitch  period 

basis.  Therefore,  it  is  thought  to  contain  more  information  about 

cycle-by-cycle  variations  in  the  speech  signal  than  the  pitch 

asynchronous  analysis.  We  did  the  pitch  synchronous  analysis  using 

the  covariance  method  for  the  same  speech  samples  used  for  the  pitch 

asynchronous  LPC  analysis  with  the  following  conditions: 

Filter  order  : 10  coefficients 
Analysis  frame  : 1 pitch  period 
Frame  overlap  : none 
Analysis  window  : Hamming  window 
Speech  preemphasis  factor  : 0.9 

Data  set  for  coefficient  calculations  : 100  frames 
Because  we  fixed  the  number  of  pitch  periods  to  be  analyzed  for 
all  the  subjects,  the  analyzed  interval  varied  depending  upon  each 
subject's  pitch  period.  We  used  the  differentiated  EGG  signal  for 
the  pitch  period  detection.  For  several  pathological  subjects  whose 
EGG  waveforms  were  too  irregular  to  obtain  100  pitch  periods  from 
the  given  data  interval,  we  did  the  LPC  analysis  with  as  many  pitch 
periods  as  possible  from  the  given  data  interval.  The  procedure  to 
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get  the  average  spectral  distortion  is  the  same  as  that  shown  in 
Figure  5-1.  We  computed  100  sets  of  correlation  terms  and  LPC 
coefficients  first.  Then  we  computed  the  average  spectral 
distortion,  defined  in  (3.20)  and  (3.18),  using  the  VQ  algorithm 
with  the  same  conditions  specified  before.  For  the  normalization  of 
correlation  terms  across  subjects,  the  speech  signal  was  normalized 
using  the  squared  energy,  after  removing  the  mean,  on  a pitch  period 
basis . 

EGG  Signal  Analysis 

Perturbation  analysis  of  the  pitch  period  and  amplitude  of  the 
voiced  speech  signal  is  a common  technique  for  the  detection  of 
laryngeal  disorders  [Lieberman,  1961  and  1963;  Koike,  1969;  Davis, 
1976].  Since  the  EGG  waveform  is  less  complex  than  the  voiced 
speech  signal  and  is  unaffected  by  the  acoustic  resonance  of  the 
vocal  tract,  it  is  thought  to  be  more  advantageous  than  the  speech 
signal  for  perturbation  analysis. 

In  an  attempt  to  determine  the  EGG  features  that  might  detect 
laryngeal  pathologies,  we  analyzed  the  EGG  waveform  in  detail  in  the 
time  domain.  First,  by  visual  inspection,  we  selected  approximately 
100  cycles  of  stable  EGG  signal,  and  measured  its  time  and  amplitude 
values  at  various  points  as  shown  in  Figure  5-2.  The  algorithm 
developed  to  measure  these  values  is  given  in  the  Appendix  B. 

Time  interval  measurement.  From  the  measurements  shown  in 
Figure  5-2,  we  defined  8 parameters  representing  different  intervals 
that  are  related  to  the  opening  phase  and  closing  phase  of  the  vocal 
folds'  vibratory  pattern.  These  are  given  as  follows. 
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Figure  5-2.  Various  measurements  of  time  and  amplitude  of 
the  EGG  signal 
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L0T=IP4-IP2 
HIT=IP7-IP5 
RIT=IP5-IP4 
FLT=I2  -IP7 
RPP=IP6-IP3 
FPP=I3  -IP6 
T0P=IP7-IP4 
TCL=I4  -IP7 

Amplitude  difference  measurement.  Ve  also  defined  the 

following  parameters  representing  amplitude  difference  at  various 

points , 

PMD=DX(IP1) 

MD=DX(I1) 

PCL=X( IP1 )-X(IP3) 

CCL=X(I1)  -X(I3) 

PRA=X(IP6)-X(IP3) 

CRA=X(I6)  -X(I3) 

PHC=X(IP6)-X(I1 ) 

CHC=X( 16 ) -X(IF1) 

PFA=X(IP6)-X(I3) 

where  X(n)  and  DX(n)  correspond  to  the  EGG  signal  and  differentiated 
EGG  signal,  respectively. 

Various  combinations  of  these  parameters  were  examined  and 
converted  to  ratios  that  served  as  the  basis  of  comparison  between 
subjects  for  the  detection  of  a laryngeal  pathology.  To  compare  our 
analysis  methods  with  the  simple  pitch  period  and  amplitude 
perturbation  analysis  method,  we  measured  the  pitch  period  and 
amplitude  sequences  from  the  EGG  signal  as  follows. 


pitch  period  =11  - IP1 

pitch  amplitude  = X(IP6)  - X(IP3) 
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Detection  Tests 
Feature  effectiveness 

The  histogram  is  a graph  of  the  percent  occurrence  of  values  of 
a feature  as  a function  of  sequential  intervals  that  span  a 

specified  range  of  values  for  the  feature.  The  graph  may  be  thought 

of  as  a discrete  approximation  to  the  probability  distribution  of 
the  variable.  To  examine  the  effectiveness  of  each  feature  for  the 

discrimination  of  normal  and  pathological  groups,  we  plotted  the 

histograms  of  each  feature  in  each  group,  and  calculated  the  mean 
and  standard  deviation  for  the  feature  being  analyzed.  These  data 
were  then  graphed  and  compared  to  a univariate  Gaussian  probability 
density  function  superimposed  on  the  histogram. 

The  feature  may  also  be  evaluated  in  pairs  by  determining  the 
crosscorrelation  between  two  features.  A correlation  of  one  means 
that  the  two  features  are  linearly  dependent.  Negative  correlation 
means  the  two  features  vary  in  opposition.  A correlation  of  zero 
implies  that  the  two  features  are  not  correlated,  and  if  their 
multivariate  probability  density  is  truly  Gaussian  then  the  features 
are  statistically  independent.  If  two  features  are  significantly 
correlated,  then  there  may  be  a common  underlying  acoustical  or 
physiological  link  between  them.  We  calculated  the  correlation 
matrices  for  the  features  in  each  group. 

Classification  test 

One  of  the  basic  problems  addressed  by  this  research  was  the 
classification  of  a subject  as  being  normal  or  pathological  on  the 
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basis  of  features  extracted  from  the  speech  and  EGG  signals.  The 
performance  of  a classifier  is  evaluated  from  theoretical  or 
empirical  calculations  of  the  error  probability.  Theoretical  error 
probabilities  are  computationally  complex,  usually  involving 
integrals  of  the  distributions  over  given  regions  in  a hyperspace. 
Empirical  calculations  of  error  probabilities  yield  good  performance 
estimates  and  are  less  complex  in  nature. 

There  are  two  types  of  error  probabilities  which  are  measured 
in  a classification  test.  The  first  kind  (Type  I)  is  called  the 
probability  of  false  alarm  (false  acceptance),  such  as  the 
probability  of  identifying  a pathology  as  present  when  there  is 
none.  The  second  kind  (Type  II)  is  called  the  probability  of  miss 
(false  rejection),  such  as  the  probability  of  not  identifying  a 
pathology  when  it  really  is  present.  More  frequently,  however,  the 
quantity  one  minus  the  probability  of  miss,  which  is  called  the 
probability  of  detection  (correct  acceptance),  is  used  [Whalen, 
1971] . 

We  designed  two  methods  for  the  classification  test.  One  is  a 
closed  threshold  test,  and  the  other  is  the  discriminant  analysis 
with  a "leave-one-out"  method.  In  a closed  threshold  test,  all  the 
data  sets  are  used  to  determine  the  threshold  and  then  they  are  used 
again  to  test  the  classifier.  This  method  provides  a lower  bound  on 
the  error  probability  for  future  performance  of  the  classifier.  The 
decision  threshold  can  be  calculated  using  various  criterion  such  as 
Neyman-Pearson  criterion  and  Bayes  criterion  [Tou  and  Gonzales, 
1974].  In  practice,  a good  procedure  is  to  determine  empirically 
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the  detection  probability  and  the  false  alarm  probability  for  each 
specified  threshold.  If  the  threshold  is  varied  over  a suitable 
range  and  the  results  tabulated,  any  given  false  alarm  probability 
may  be  realized  with  a corresponding  change  in  the  detection 
probability  [Davis,  1976].  We  determined  the  threshold  for  each 
feature  using  this  approach,  and  the  probability  of  false  alarm  was 
set  at  less  than  10%. 

The  discriminant  analysis  procedure  attempts  to  distinguish 
between  two  or  more  groups  on  the  basis  of  a set  of  measures  for 
which  the  groups  are  expected  to  differ.  These  discriminating 
variables  are  weighted  and  taken  in  various  linear  combinations  so 
that  the  groups  are  forced  to  be  as  distinct  as  possible.  The  form 
of  these  linear  combinations  of  variables  is 

^i  = ^ilzl  + ^ i 2 z 2 + • • • + dimzm  * (5.2) 

where  is  the  score  on  the  discriminant  function  i,  the  d's  are 
the  weighting  factors,  and  the  z's  and  m are  the  values  of 
discriminating  variables  and  the  number  of  variables  used  in  the 
analysis,  respectively. 

In  order  to  do  the  discriminant  analysis,  we  need  both  a Design 
Set  and  a Test  Set.  The  discriminant  functions  are  calculated  from 
the  Design  Set.  Once  the  discriminant  functions  have  been 
generated,  the  classification  is  done  by  assigning  the  test  input 
from  the  Test  Set  to  one  of  the  groups,  based  on  the  scores  obtained 
from  the  input.  The  "leave-one-out"  method  uses  all  the  data  sets 
except  one  to  determine  the  discriminant  function.  The  single  data 
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set  held  out  becomes  a Test  Set  for  the  classification  test.  This 
procedure  is  repeated  with  all  the  data  sets  being  held  out  in 
rotation,  and  the  empirical  error  probabilities  are  averaged.  Duda 
and  Hart  [1973]  suggest  that  the  "leave-one-out"  method  is  a good 
way  to  adequately  determine  the  error  probability  within  a 
reasonable  confidence  level.  We  used  the  DMATX,  MINV,  DISCR 
subroutines  available  in  the  IBM's  Scientific  Subroutine  Package  to 
determine  the  discriminant  function. 

Voice  profile  analysis 

We  examined  the  speech  and  EGG  waveforms  visually  as  well  as 
with  the  perceptual  judgements,  conducted  by  five  speech 
specialists,  for  the  subjects  who  were  misclassif ied  from  the  closed 
threshold  test.  Among  the  pathological  subjects  in  our  data  base, 
listed  in  Table  4-1,  the  subject  whose  second  initial  is  given  with 
a number  represents  the  patient  undergoing  treatment  of  a laryngeal 
pathology.  For  example,  the  number  "2"  and  "3"  represent  the  second 
and  third  visit  respectively.  We  also  examined  the  data  from  these 
patients  as  well. 


Results  and  Discussion 
Acoustic  Signal  Analysis 

Feature  effectiveness.  The  histograms  for  the  two  features, 
SD1  and  SD4  that  were  obtained  from  the  pitch  asynchronous  LPC 
analysis  and  the  pitch  synchronous  LPC  analysis,  for  each  of  the  two 
groups  are  shown  in  Figures  5-3  and  5-4.  The  Gaussian  probability 
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Figure  5-3.  Histograms  of  SD1  and  SD4  for  normal  and  pathological 
groups  (Pitch  asynchronous  LPC  analysis) 

Normal  group  : 52  subjects 
Pathological  group  : 29  subjects 
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Figure  5-4.  Histograms  of  SD1  and  SD4  for  normal  and  pathological 
groups  (Pitch  synchronous  LPC  analysis) 

Normal  group  : 52  subjects 
Pathological  group  : 29  subjects 
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distributions  are  overlaid  on  each  histogram.  The  means  and 
standard  deviations  of  the  two  features  are  given  in  Tables  5-1  and 
5-2.  From  Figures  5-3  and  5-4,  we  can  see  that  feature  SD4  gives 
better  discrimination  between  normal  and  pathological  groups  than 
feature  SD1. 

Classification  test.  Using  the  feature  SD4,  LPC  distortion 
with  4 centroids,  we  examined  how  changes  in  the  threshold  for  the 
normal/pathological  decision  affect  the  detection  probability. 
Figure  5-5  shows  graphs  of  the  detection  probability  as  a function 
of  threshold.  The  probability  of  false  alarm  (false  acceptance)  may 
be  determined  from  one  minus  the  probability  of  detection  for  the 
normal  group.  The  probability  of  detection  is  the  probability  of 
correct  acceptance  for  the  pathological  group.  All  probabilities 
are  expressed  as  percent.  Therefore,  for  a given  false  alarm 
probability,  the  threshold  and  corresponding  detection  probability 
may  be  determined. 

When  we  set  the  probability  of  false  alarm  at  less  than  10%, 
the  test  results  showed  75.9%  (22  corrects/29  subjects)  correct 
detection  of  pathological  subjects  with  a false  alarm  probability  of 
9.6%  (5  misses/52  subjects)  for  the  pitch  asynchronous  LPC  analysis 
method,  and  44.8%  (13  corrects/29  subjects)  correct  detection  of 
pathological  subjects  with  a false  alarm  probability  of  9.6%  (5 
misses/52  subjects)  for  the  pitch  synchronous  analysis  method.  The 
corresponding  thresholds  are  0.005  for  the  pitch  asynchronous 
analysis,  and  0.0005  for  the  pitch  synchronous  analysis. 
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(a) 


(b) 


Figure  5-5.  Detection  probability  as  a function  of  threshold 

(a)  Pitch  asynchronous  LPC  analysis 

(b)  Pitch  synchronous  LPC  analysis 
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Table  5-1.  Statistics  of  SD1  and  SD4 

(Pitch  asynchronous  LPC  analysis) 
Normal  group  : 52  subjects 
Pathological  group  : 29  subjects 


SD1 

SD4 

Groups 

Mean  St.  Dev. 

Mean  St.  Dev. 

Normal 

0.00829  0.006343 

0.003001  0.002329 

Pathological 

0.02028  0.029460 

0.008097  0.006233 

Table  5-2.  Statistics  of  SD1  and  SD4 

(Pitch  synchronous  LPC  analysis) 


Groups 

SD1 

SD4 

Mean  St.  Dev. 

Mean  St.  Dev. 

Normal 

Pathological 

0.000616  0.000492 

0.000243  0.000201 

0.002788  0.007282 

0.001205  0.002847 
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If  we  allow  more  false  alarm  probability  in  order  to  increase 
the  detection  probability,  the  pitch  asynchronous  analysis  gives 
82.6%  correct  detection  of  pathological  subjects  with  a false  alarm 
probability  of  13.6%,  while  the  pitch  synchronous  analysis  gives 
72.4%  correct  detection  of  pathological  subjects  with  a false  alarm 
probability  of  26.9%.  The  results  of  the  closed  threshold  test  and 
the  discriminant  analysis  with  a "leave-one-out  method"  are  given  in 
Tables  5-3  and  5-4. 

Discussion.  A direct  use  of  the  LPC  coefficients  for  the 
evaluation  of  laryngeal  function  has  been  done  by  Deller  and 
Anderson  [1980]  and  Smith  [1980].  They  examined  the  placement  of 
the  zeros  of  the  inverse  filter  (polynomial  function  consisted  of 
the  LPC  coefficients  in  z-domain)  over  the  consecutive  segments  of 
the  speech  signal.  Most  of  other  researchers,  however,  used  the  LPC 
technique  to  get  the  residual  signal  by  inverse  filtering  the  speech 
signal  [Koike  and  Markel,  1975;  Davis,  1976;  Prosek  et  al.,  1987; 
Muta  et  al . , 1987 ] . 

Deller  and  Anderson  [1980]  demonstrated  the  ability  of  the  LPC 
technique  to  detect  the  laryngeal  dysfunction  from  the  simulated 
anomalous  synthetic  speech  signals.  Smith  [1980]  did  similar  works 
for  the  real  speech  signals.  However,  his  test  result  of  the 
laryngeal  pathology  detection  was  poor.  It  is  thought  to  be  due  to 
the  fact  that  his  analysis  method  did  not  consider  the  perturbations 
of  the  zeros  of  the  inverse  filter  that  reflect  the  laryngeal 
dysfunction  in  the  speech  signal. 
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With  the  VQ  technique,  we  could  compute  effectively  the 
perturbations  of  the  given  sets  of  the  LPC  coefficients  (i.e.,  the 
LPC  spectral  distortion  measure)  in  the  frequency  domain.  The 
result  of  the  classification  tests  for  our  data  base  clearly  showed 
that  the  LPC  spectral  distortion  measure  has  the  potential  for  the 
evaluation  of  laryngeal  function. 

The  visual  inspection  of  the  histograms  shows  that  two  features 
(SD1  and  SD4)  of  the  normal  group  fit  the  Gaussian  distribution 
fairly  well,  however,  those  of  the  pathological  group  do  not.  This 
is  due  to  the  fact  that  our  pathological  data  base  does  not  contain 
enough  representative  cases  of  voice  disorders  and  the  size  of  the 
data  base  is  small. 

We  had  expected  that  the  pitch  synchronous  analysis  would  give 
better  results  than  the  pitch  asynchronous  analysis  in  the 
discrimination  of  the  normal  and  pathological  groups.  Contrary  to 
our  expectation,  however,  Tables  5-3  and  5-4  show  that  the  pitch 
synchronous  analysis  result  is  not  as  good  (or  separable)  as  the 
pitch  asynchronous  analysis. 

The  pitch  synchronous  LPC  analysis  using  the  covariance  method 
uses  a very  small  number  of  speech  samples  compared  with  the  pitch 
asynchronous  LPC  analysis  by  the  autocorrelation  method. 
Nevertheless,  the  spectrum  of  the  pitch  synchronous  LPC  analysis  has 
a tendency  to  show  more  detailed  formant  structures  than  that  of  the 
pitch  asynchronous  LPC  analysis,  as  shown  in  Figure  5-6.  Thus,  a 
small  variation  in  the  pitch  period  of  normal  subjects  could  result 
in  relatively  large  distortion  in  the  frequency  range  of  higher 
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(a)  FILTER  ORDER  * 10 


(b)  FILTER  ORDER  - 10 


Figure  5-6.  Examples  of  the  LPC  spectra  of  the  sustained  vowel  / i / 

for  subject  BKS  (5  spectra  are  superimposed  in  each  graph) 

(a)  Pitch  asynchronous  analysis  using  the  autocorrelation 
method  with  Hamming  window 

(b)  Pitch  synchronous  analysis  using  the  covariance  method 
with  Hamming  window 
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Table  5-3.  Results  of  the  classification  test 

(Closed  threshold  test  with  single  feature  SD4) 
Normal  group  : 52  subjects 
Pathological  group  : 29  subjects 


Pitch  asynchronous  LPC  distortion  analysis 


From  To 

Normal* 

Pathological 

Normal 

47/52 

5/52 

Pathological 

7/29 

22/29 

Pitch  synchronous  LPC  distortion  analysis 


From  To 

Normal 

Pathological 

Normal 

47/52 

5/52 

Pathological 

16/29 

13/29 

Table  5-4.  Results  of  the  classification  test 

(Discriminant  analysis  with  a "leave-one-out"  method) 


Pitch  asynchronous  LPC  distortion  analysis 


From  To 

Normal 

Pathological 

Normal 

48/52 

4/52 

Pathological 

9/29 

20/29 

Pitch  synchronous  LPC  distortion  analysis 


From  To 

Normal 

Pathological 

Normal 

49/52 

3/52 

Pathological 

19/29 

10/29 
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formants.  This  might  explain  why  the  pitch  synchronous  analysis 
resulted  in  poor  discrimination  of  normal  and  pathological  subjects 
compared  to  the  pitch  asynchronous  analysis. 

EGG  Signal  Analysis 

Feature  effectiveness.  The  measured  values  of  8 time  interval 
parameters  for  a typical  normal  and  a typical  pathological  subjects 
are  shown  in  Figure  5-7.  For  these  graphs,  the  horizontal  axis 
reflects  different  interval  measurements,  while  the  vertical  axis 
indicates  the  measured  values  in  number  of  samples.  Each  plot 
represents  the  measurements  for  each  of  the  100  cycles,  therefore 
each  dot  represents  one  measurement  within  a cycle.  The  values  of 
these  parameters  vary  depending  on  each  subject's  pitch  period. 
Therefore,  we  converted  these  parameters  into  ratios  such  as 
LOT/HIT,  FLT/RIT , FPP/RPP,  TCL/TOP.  The  graphs  of  these  4 
parameters  for  the  same  subjects  shown  in  Figure  5-7  are  given  in 
Figure  5-8. 

Similarly,  we  selected  4 new  parameters  such  as  MD/PMD, 
PFA/PRA,  CHC/PHC,  MD/PFA,  after  inspecting  various  combinations  of  9 
amplitude  difference  parameters.  The  graphs  of  these  4 parameters 
for  a typical  normal  and  a typical  pathological  subjects  are  shown 
in  Figure  5-9. 

After  examining  the  average,  standard  deviation,  and  dispersion 
of  the  newly  defined  8 parameters  (4  parameters  for  the  ratios  of 
time  interval  measurements  and  4 parameters  for  the  ratios  of 
amplitude  difference  measurements),  we  selected  the  following  9 
features  as  being  more  useful  than  the  others. 
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(a) 


(b) 


Figure  5-7.  Plot  of  8 parameters  of  time  interval  measurements 

(a)  Normal  subject  : DMH 

(b)  Pathological  subject  : DMH^y 
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Figure  5-8.  Plot  of  4 parameters  defined  by  the  ratio  of  time 
interval  measurements 

(a)  Normal  subject  : DMH 

(b)  Pathological  subject  : DMH^7 
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Figure  5-9.  Plot  of  4 parameters  defined  by  the  ratio  of  amplitude 
difference  measurements 

(a)  Normal  subject  : MBK 

(b)  Pathological  subject  : JCY 
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AVT1  : average  of  FLT/RIT 
AVT2  : average  of  FPP/RPP 
AVT3  : average  of  TCL/TOP 

DPT1  : dispersion  of  FLT/RIT 
DPT2  : dispersion  of  FPP/RPP 
DPT3  : dispersion  of  TCL/TOP 

DPA1  : dispersion  of  MD/PMD 
DPA2  : dispersion  of  PFA/PRA 
DPA3  : dispersion  of  CHC/PHC 

The  first  three  features  represent  the  average  value  of  time 
interval  ratios  related  to  the  vocal  folds'  closing  to  opening 
phase.  The  remaining  six  features  represent  the  amount  of 
perturbations  in  both  time  and  amplitude  of  the  EGG  signal.  The 
means,  standard  deviations,  and  correlations  of  these  features  for 
normal  and  pathological  groups  are  given  in  Table  5-5. 

Classification  test.  We  determined  each  feature's  threshold 
with  a false  alarm  probability  of  less  than  10%  for  the  detection  of 
voice  disorders.  We  divided  the  9 features  into  3 groups  - 
representing  average  time  interval  measurement,  variation  in  time 
interval  measurement,  and  variation  in  amplitude  difference 
measurement.  Then  we  counted  the  number  of  features  whose  values 
exceed  the  thresholds  for  each  subject.  Table  5-6  shows  the  results 
for  all  the  pathological  subjects  and  some  normal  subjects  who  were 
misclassif ied  in  the  test.  The  classification  results  of  the  closed 
threshold  test,  obtained  from  the  Table  5-6,  showed  69.0%  (20 
corrects/29  subjects)  correct  detection  of  the  pathological  subjects 
with  a false  alarm  probability  of  9.6%.  The  classification  results 
with  the  closed  threshold  test  and  the  discriminant  analysis  with  a 
"leave-one-out"  method  are  shown  in  Table  5-7.  In  the  discriminant 
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Table  5-5.  Statistics  of  selected  features  from  the  EGG  signal 
(a)  Normal  group 


DPT1 

DPT  2 

DPT3 

DPA1 

DPA2 

DPA3 

Mean 

0.10791 

0.08969 

0.06105 

0.06758 

0.02147 

0.03732 

St.  Dev. 

0.05467 

0.04494 

0.04979 

0.01713 

0.03178 

0.01589 

AVT1 

AVT2 

AVT3 

Mean 

0.45225 

0.61680 

0.53040 

St.  Dev. 

0.61388 

0.36201 

0.37884 

Correlation  matrix 


DPT1 

DPT2 

DPT3 

DPA1 

DPA2 

DPA3 

DPT1 

1.0 

DPT2 

0.393 

1.0 

DPT3 

0.776 

0.346 

1.0 

DPA1 

0.471 

0.323 

0.427 

1.0 

DPA2 

0.208 

0.061 

0.273 

0.228 

1.0 

DPA3 

0.664 

0.536 

0.676 

0.654 

0.357 

1.0 

AVT1 

0.089 

-0.121 

0.239 

0.086 

0.857 

0.146 

AVT2 

0.196 

-0.130 

0.344 

0.157 

0.693 

0.187 

AVT3 

0.014 

-0.218 

0.152 

0.008 

0.827 

0.022 

AVT1  AVT2  AVT3 

AVT1 

AVT2 

AVT3 

1.0 

0.906  1.0 

0.959  0.815  1.0 

115 


Table  5-5  continued 
(b)  Pathological  group 


DPT1 

DPT2 

DPT3 

DPA1 

DPA2 

DPA3 

Mean 

0.27510 

0.22571 

0.20063 

0.15919 

0.07730 

0.11572 

St.  Dev. 

0.23476 

0.19851 

0.17909 

0.11921 

0.08307 

0.10554 

AVT1 

AVT2 

AVT3 

Mean 

1.02920 

1.06320 

0.81980 

St.  Dev. 

1.03360 

0.70870 

0.56543 

Correlation  matrix 


DPT1 

DPT2 

DPT3 

DPAl 

DPA2 

DPA3 

DPT1 

1.0 

DPT2 

0.965 

1.0 

DPT3 

0.866 

0.876 

1.0 

DPA1 

0.498 

0.482 

0.544 

1.0 

DPA2 

0.483 

0.479 

0.520 

0.821 

1.0 

DPA3 

0.258 

0.274 

0.339 

0.414 

0.533 

1.0 

AVT1 

0.097 

0.088 

0.162 

0.027 

-0.065 

-0.054 

AVT2 

-0.009 

-0.016 

0.098 

-0.039 

-0.140 

-0.119 

AVT3 

0.203 

0.186 

0.246 

0.059 

0.010 

0.032 

AVT1  AVT2  AVT3 

AVT1 

AVT2 

AVT3 

1.0 

0.974  1.0 

0.903  0.851  1.0 
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Table  5-6.  Results  of  EGG  analysis  with  three  grouped  features 
(Number  of  features  exceeding  the  threshold) 


Subject 

AVT 

DPT 

DPA 

DMH3 

(M) 

3 

3 

3 

DMH10 

(M) 

3 

0 

2 

DMH17 

(M) 

1 

3 

3 

DMH31 

(M) 

3 

3 

3 

GPM3 

(M) 

3 

0 

0 

GPM10 

(M) 

3 

3 

3 

ABC 

(F) 

2 

0 

0 

AHZ 

(F) 

0 

3 

3 

CWG 

(F) 

3 

2 

3 

C2G 

(F) 

3 

2 

3 

C3G 

(F) 

3 

3 

3 

CLW 

(F) 

0 

3 

3 

EDR 

(M) 

0 

0 

0 

E2R 

(M) 

0 

0 

0 

JJS 

(M) 

3 

1 

3 

JCY 

(F) 

0 

3 

3 

JFH 

(M) 

0 

0 

0 

JMS 

(M) 

3 

0 

2 

J2S 

(M) 

1 

2 

3 

JTO 

(M) 

0 

0 

0 

MLM 

(F) 

0 

0 

0 

MXN 

(F) 

0 

1 

3 

NUK 

(F) 

0 

3 

3 

N2K 

(F) 

0 

0 

0 

N3K 

(F) 

3 

3 

3 

SIR 

(M) 

0 

3 

3 

PJB 

(F) 

1 

3 

0 

P2B 

(F) 

0 

0 

0 

TLW 

(F) 

0 

0 

1 

CAP 

(F) 

1 

2 

0 

EMD 

(M) 

3 

0 

1 

JEW 

(F) 

1 

0 

3 

MTS 

(F) 

2 

2 

3 

SLG 

(F) 

2 

1 

1 
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Table  5-7.  Results  of  the  classification  test  from  the  EGG  signal 
analysis 

Closed  threshold  test 


From  To 

Normal 

Pathological 

Normal 

47/52 

5/52 

Pathological 

9/29 

20/29 

Discriminant  analysis  with  a "leave-one-out"  method 


From  To 

Normal 

Pathological 

Normal 

48/52 

4/52 

Pathological 

9/29 

20/29 

Table  5-8.  Results  of  classification  test  from  the  EGG  signal 
(Pitch  period  and  amplitude  perturbation  analysis) 


Closed  threshold  test 


From  To 

Normal 

Pathological 

Normal 

47/52 

5/52 

Pathological 

11/29 

18/29 

Discriminant  analysis  with  a "leave-one-out"  method 


From  To 

Normal 

Pathological 

Normal 

49/52 

3/52 

Pathological 

12/29 

17/29 
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analysis,  we  used  three  grouped  features  combined  with  appropriate 
weightings.  From  the  pitch  period  and  amplitude  measurements  of  the 
EGG  signal,  we  did  the  same  classification  tests,  and  the  results 
are  shown  in  Table  5-8. 

Discussion.  The  EGG  signal  reflects  the  vibratory  cycle  of  the 
vocal  folds  with  fairly  high  fidelity.  Irregularities  in  the  EGG 
signal  thus  correspond  to  irregularities  in  the  vibratory  pattern  of 
the  vocal  folds.  In  order  to  detect  the  irregularities  in  the  EGG 
signal  better,  we  did  various  measurements  of  time  and  amplitude  of 
the  EGG  signal. 

For  the  normal  subjects,  the  rising  time  of  the  EGG  waveform  is 
generally  longer  than  the  falling  time.  Thus,  the  three  features 
(AVT1,  AVT2,  and  AVT3)  that  are  related  to  the  ratio  of  the  falling 
time  to  the  rising  time  of  the  EGG  waveform  may  have  values  less 
than  one  for  the  normal  subject.  However,  the  smaller  value  of 
these  three  features  does  not  mean  the  better  conditions  in  the 
vocal  folds'  vibratory  pattern.  On  the  other  hand,  for  the  features 
such  as  DPT1 , DPT2 , DPT 3 , DPA1,  DPA2 , and  DPA3  that  represent  the 
perturbations  of  the  EGG  signal,  the  smaller  value  implies  the 
better  conditions  in  the  folds'  vibratory  pattern.  As  expected, 
Table  5-5  shows  that  the  mean  and  standard  deviation  of  each  feature 
for  the  pathological  group  are  larger  than  that  of  the  normal  group. 

From  the  correlation  matrices  in  Table  5-5,  we  can  see  that  the 
correlations  among  the  features  DPT1,  DPT2,  and  DPT3  show  more 
significant  positive  correlations  in  the  pathological  group  than  in 
the  normal  group.  The  correlations  among  the  features  DPA1,  DPA2, 
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and  DPA3  also  show  similar  patterns.  This  implies  that  the  features 
such  as  DPT1,  DPT2 , DPT 3 , and  DPA1,  DPA2 , DPA3  can  provide  more 
information  about  the  EGG  waveform  perturbations  for  the  detection 
of  voice  disorders  than  the  simple  pitch  period  and  amplitude 
perturbation  measures  used  by  Haji  et  al.  [1986]. 

Voice  Profile  Analysis 

Evaluation  of  misclassi f ied  subjects 

So  far  we  have  analyzed  the  speech  and  EGG  signals  for  the 
evaluation  of  laryngeal  function  with  the  following  two  methods  we 
developed. 

Method  I : Pitch  asynchronous  LPC  spectral  distortion 
analysis  using  VQ 

Method  II  : EGG  waveform  perturbation  analysis 
We  also  analyzed  the  pitch  period  and  amplitude  perturbation  from 
the  EGG  signal  to  compare  the  results  with  that  of  our  methods. 

We  summarized  the  results  of  the  correct  classification  of 
normal  and  pathological  subjects  in  Table  5-9.  We  listed  the 

subjects  who  were  misclassif ied  from  the  closed  threshold  test  for 
further  examination  with  the  speech  and  EGG  waveforms  as  well  as 
perceptual  judgements  individually. 

The  subject  DMH3  has  a mimicked  vocal  fry,  and  his  speech  and 
EGG  waveforms  are  shown  in  Figure  5-10.  The  speech  and  EGG 
waveforms  show  two  irregular  long  pitch  periods,  approximately  22.5 
msec  and  32.5  msec  respectively,  and  this  resulted  in  large 
perturbations  of  pitch  periods.  This  subject  was  judged  as  having  a 
severe  vocal  fry  by  listening  tests. 
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100  ms 


Figure  5-10.  Speech  and  EGG  waveforms  of  the  sustained  vowel  /i/ 
for  subject  DMH3 


100  ms 


Figure  5-11.  Speech  and  EGG  waveforms  of  the  sustained  vowel  / i / 
for  subject  JCY 
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Table  5-9.  Comparison  of  classification  test  results 


Pathological  Normal 

Method  I 

Closed  thr.  test 
Discr.  analysis 
Misclassif ied 
subjects 

75.9  X 90.4  % 

69.0  X 92.3  X 

dmh3,jcy,jto  aaa,pxs,dcd 

EDR,E2R,J2S,MLM  MJS.DAS 

Method  II 

Closed  thr.  test 
Discr.  analysis 
Misclassif ied 
subjects 

69.0  X 90.4  X 

69.0  % 92.3  X 

ABC , EDR, E2R, JT0  CAP,EMD,JEW 

JFH,MLM,N2K,P2B,TLW  MTS, SLG 

Pitch 

perturbation 

analysis 

Closed  thr.  test 
Discr.  analysis 
Misclassif ied 
subjects 

62.1  X 90. A X 

58.6.X  94.2  X 

ABC,C2G,DMH10  DGC,D0B,EMD 

EDR,E2R, JFH.JMS  MTS, JEW 

MLM,N2K, P2B,TLW 
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The  subject  JCY  was  judged  as  having  a very  severe  hoarse  voice 
by  listening  tests.  Her  speech  and  EGG  waveforms  are  shown  in 
Figure  5-11,  and  they  look  quite  stable  in  the  pitch  period  and 
amplitude. 

Both  speech  and  EGG  waveforms  of  such  subjects  as  JTO,  N2K, 
JFH , MLM , EDR,  E2R,  TLW,  ABC,  and  DMHjq  have  been  shown  to  be  very 
regular  and  stable,  and  these  subjects  were  hardly  recognizable  as 
pathological  by  listening  tests.  For  example,  Figure  5-12  shows  the 
speech  and  EGG  waveforms,  and  the  pitch  synchronous  LPC  spectra  for 
the  subject  JTO.  Both  waveforms  look  quite  normal  and  the  LPC 
spectra  also  show  very  normal  characteristics.  Every  feature  of 
this  subject,  obtained  from  methods  I and  II,  has  values  less  than 
the  thresholds.  We  will  mention  other  misclassi f ied  subjects  later 
in  the  discussion  section. 

Evaluation  of  subjects  during  therapy 

The  subject  whose  second  initial  is  given  with  a number 
represents  a subject  who  is  undergoing  treatment  of  a laryngeal 
pathology.  For  example,  subject  PJB  is  a female  patient  having  a 
weak  and  breathy  voice  before  treatment,  and  the  subject  P2B 
represents  the  same  subject  after  treatment.  The  analysis  results 
for  these  subjects  are  shown  in  Table  5-10. 

If  we  examine  the  data  for  subject  EDR,  the  feature  values  do 
not  show  much  difference  between  pre-  and  post-treatment . All  the 
feature  values  for  this  subject  appeared  as  normal.  Subject  PJB 
shows  large  perturbations  in  both  the  time  interval  measurements  of 
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(a) 


100  ms 


(b) 


Figure  5-12.  Speech  and  EGG  waveforms  and  the  pitch  synchronous 
analyzed  LPC  spectra  for  subject  JTO 

(a)  Speech  and  EGG  waveforms  of  the  sustained  vowel  / i / 

(b)  Spectra  for  10th  order  LPC  covariance  method 
The  spectra  for  5 LPC  vectors  are  superimposed. 
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the  EGG  signal  and  LPC  spectral  distortion  of  the  speech  signal 
before  treatment.  However,  these  perturbations  were  reduced  greatly 
after  treatment,  as  shown  in  Table  5-10.  The  data  for  subject  CWG 
show  some  improvement  during  the  first  treatment,  however,  they  show 
deterioration  after  the  second  treatment.  The  data  for  subject  NUK. 
also  shows  similar  patterns.  Generally,  the  perceptual  judgements 
for  these  subjects  agreed  well  with  the  analysis  results. 

Discussion 

In  Table  5-9,  we  can  see  that  our  methods  give  69.0  to  75.9% 
correct  classification  of  pathological  subjects  with  7.7  to  9.6% 
misclassif ication  of  normal  subjects.  The  superiority  of  our 
methods  to  the  pitch  period  and  amplitude  perturbation  analysis  is 
also  shown  in  the  table. 

The  speech  and  EGG  signals  for  subject  DMH3  showed  two 
irregular  long  pitch  periods.  Since  the  pitch  period  is  irregular 
and  sometimes  it  is  longer  than  the  LPC  analysis  frame  size,  as 
shown  in  Figure  5-10,  the  LPC  coefficients  of  each  frame  will  change 
substantially  depending  upon  the  position  of  the  analysis  frame. 
Therefore,  for  the  given  sets  of  LPC  vectors,  the  distortion  with 
one  centroid  (SD1)  was  very  large.  However,  since  the  speech 
waveform  shows  somewhat  regular  patterns  with  two  different  long 
pitch  periods,  the  average  distortion  with  four  centroids  (SD4) 
reduced  greatly.  This  is  thought  to  be  the  reason  for  the 

misclassif ication  of  this  subject  with  method  I. 
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Table  5-10.  Analysis  results  for  the  subjects 
undergoing  vocal  treatment 

Method  I 


SD1  SD4 

EDR 

E2R 

0.0091  0.0047 

0.0081  0.0037 

PJB 

P2B 

0.0313  0.0151 

0.0156  0.0087 

CVG 

C2G 

C3G 

0.0353  0.0118 
0.0195  0.0086 
0.0152  0.0087 

NUK 

N2K 

N3K 

0.0282  0.0124 
0.0157  0.0064 
0.0090  0.0052 

Method  II 


AVT1 

AVT2 

AVT3 

DPT1 

DPT2 

DPT3 

DPA1 

DPA2 

DPA3 

EDR 

0.297 

0.607 

0.452 

0.088 

0.052 

0.054 

0.081 

0.008 

0.035 

E2R 

0.588 

0.823 

0.637 

0.078 

0.047 

0.024 

0.069 

0.008 

0.029 

PJB 

0.726 

0.869 

0.719 

0.739 

0.765 

0.675 

0.067 

0.018 

0.043 

P2B 

0.280 

0.925 

0.372 

0.169 

0.140 

0.060 

0.082 

0.016 

0.045 

CVG 

1.932 

1.517 

1.298 

0.142 

0.153 

0.155 

0.135 

0.085 

0.126 

C2G 

1.726 

1.605 

1.161 

0.199 

0.114 

0.254 

0.108 

0.075 

0.143 

C3G 

2.518 

2.281 

1.836 

0.304 

0.166 

0.331 

0.133 

0.101 

0.129 

NUK 

0.411 

0.567 

0.479 

0.301 

0.356 

0.274 

0.169 

0.138 

0.405 

N2K 

0.270 

0.492 

0.388 

0.165 

0.080 

0.055 

0.032 

0.013 

0.021 

N3K 

1.218 

1.177 

1.229 

0.424 

0.361 

0.315 

0.238 

0.242 

0.172 
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The  misclassif  ication  of  subject  JCY  with  method  I can  be 
explained  in  two  ways.  One  possible  explanation  is  that,  with 
deviant  phonatory  production,  there  is  no  consistent  instability  of 
the  vocal  folds'  vibration.  Instead  the  vibratory  behavior  seems  to 
change  from  cycles  of  normal  to  cycles  of  instability  with  no 
predictable  pattern.  This  was  observed  from  the  EGG  signal  analysis 
of  this  subject.  While  the  perceptual  judgement  is  done  by 
listening  to  the  entire  data  file,  we  analyzed  the  most  stable 
portion  of  the  data  file,  determined  by  visual  inspection  of  the 
speech  and  EGG  waveforms.  As  shown  in  Figure  5-11,  both  speech  and 
EGG  waveforms  show  great  regularity.  The  other  explanantion  is 
that,  in  method  I,  each  subject's  average  LPC  distortion  is  computed 
with  its  own  reference  (or  centroid).  Therefore,  although  the 
spectral  characteristics  of  the  given  speech  show  poor  formant 
characteristics,  if  the  waveform  shows  regularity  then  the  average 
LPC  distortion  becomes  very  small. 

Among  the  misclassif ied  normal  subjects  with  method  II, 
subjects  EMD  and  MTS  showed  very  regular  and  stable  speech 
waveforms,  however,  they  have  very  strange  EGG  waveforms  as  shown  in 
Figure  5-13.  At  present,  we  are  unable  to  determine  whether  this  is 
due  to  an  artifact  in  the  EGG  recordings  or  to  some  abnormality  in 
vocal  folds'  vibratory  pattern. 

The  speech  and  EGG  waveforms  of  subject  NUK  during  therapy  are 
shown  in  Figure  5-14.  The  EGG  waveforms  of  NUK  and  N3K  show  severe 
irregularities,  but  the  speech  waveforms  are  quite  regular  comparing 
to  the  EGG  waveform.  Both  speech  and  EGG  waveforms  of  N2K  are  very 
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100  ms 


Figure  5-13.  Speech  and  EGG  waveforms  of  the  sustained  vowel  /i/ 
for  subjects  EMD  and  MTS 

(a)  Male  subject  : EMD 

(b)  Female  subject  : MTS 
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NUK 


100  msec 


Figure  5-14.  Speech  and  EGG  waveforms  for  the  sustained  vowel  / 
for  subject  NUK  during  therapy 
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regular  and  stable.  From  the  visual  inspection  of  speech  and  EGG 
waveforms,  we  can  see  that  this  subject  improved  greatly  after  the 
first  treatment,  but  again  the  voice  became  worse  than  before  in 
some  manner.  In  Table  5-10,  method  I shows  that  the  subject  NUK 
improved  slightly  after  the  second  treatment.  This  contradicts  the 
perceptual  judgements  as  well  as  the  results  of  method  II  and  the 
visual  inspection  of  the  speech  and  EGG  waveforms.  Previously  we 
mentioned  the  problem  of  method  I for  the  severe  hoarse  voice  as  in 
the  case  of  subject  JCY.  The  same  explanation  can  be  given  to  the 
subject  NUK. 


CHAPTER  SIX 


SPEAKER  IDENTIFICATION  BY  VOICE 
Introduction 

Speaker  identification  based  on  the  analysis  of  spoken 
utterances  is  an  important  aspect  in  developing  man-machine 
interaction  systems  using  voice.  The  usefulness  of  identifying  a 
person  from  the  characteristics  of  his/her  voice  is  increasing  with 
the  growing  importance  of  automatic  information  processing  and 
telecommunications  [Doddington,  1985]. 

The  objective  of  a speaker  identification  system  is  to 
determine  the  identity  of  the  person  by  his/her  voice  from  among  a 
known  population.  The  pitch  synchronous  LPC  analysis  provides  a 
good  parametric  representation  for  the  very  short  segment  of  vowel 
phonations.  Therefore,  the  pitch  synchronous  LPC  vectors  for  the 
sustained  vowel  phonation  were  used  as  feature  vectors.  As  an 
efficient  way  of  characterizing  the  speaker-specific  features,  we 
used  the  speaker-based  VQ  codebook  approach  [Soong  et  al.,1987]. 

This  chapter  is  organized  in  the  following  manner:  We  first 
describe  the  experimental  procedures  to  design  and  analyze  the 
speaker  identification  scheme.  The  data  base  for  speaker 
identification  experiment  is  also  described.  Then  we  present  the 
analysis  and  identification  test  results  with  a discussion  of  our 
findings . 
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Experimental  Procedure 

Data  Base 

Over  a period  of  one  month,  we  collected  the  speech  and  EGG 
signals  for  the  sustained  vowel  from  40  speakers  (20  males  and  20 
females)  in  four  different  recording  sessions.  The  2nd,  3rd,  and 
4th  recordings  were  made  two  days  after,  one  week  after,  and  about 
one  month  after  the  first  recording.  The  list  of  speaker's  name 
with  an  assigned  number  is  given  in  Table  6-1.  In  each  recording 
session,  the  speaker  was  asked  to  phonate  the  vowel  /i/  in  the  word 
"beet"  two  times.  Then  the  8 utterances  from  each  speaker  were 
split  into  two  parts.  From  the  first  4 utterances,  we  obtained  400 
sets  of  pitch  synchronous  LPC  coefficients  and  correlation  terms 
(100  sets  from  each  utterance)  and  then  they  were  used  for  training 
(VQ  codebook  generation).  From  the  remaining  part,  we  obtained  200 
sets  of  pitch  synchronous  LPC  coefficients  and  correlation  terms  (50 
sets  from  each  utterance)  for  testing. 

Speaker  Identification  Scheme 

The  block  diagram  of  a speaker  identification  based  on  the  VQ 
codebook  approach  is  shown  in  Figure  6-1.  An  input  vector  consists 
of  a pitch  synchronous  LPC  vector,  x,  and  the  matrix  of  correlation 
terms,  Cx,  associated  with  the  LPC  vector  x.  In  Figure  6-1,  to 
measure  the  similarity  between  the  test  input  and  each  speaker's 
codebook  of  N known  speakers,  the  distortion  for  each  speaker,  Dj, 
is  computed  by 
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i = l 
NRk=0 
for  l<k<N 


Input  vectors 
(xl , Cxj) 
for  l<i<M 


HVQ  Codebook, 
of  Speaker  #2 


VQ  Codebook 
of  Speaker  #1 


-VQ  Codebook 
of  Speaker  #N 


i = i + l |<r 


Dl 


d2 


ed2 


Find  k 
such  that 

k=argmin  Dj 
l<j<N 

And  then 
set 

NRk=NRk+l 
, l<k<N 


i <M 
? 


No 


Yes 


Identified  Speaker 
is  speaker  #m 


Find  m 
such  that 

m=argmax  NRk 
l<k<N 


Figure  6-1.  Block  diagram  of  proposed  speaker  identification  scheme 
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Table  6-1.  List  of  male  and  female  groups  used  for  speaker 
identification  experiment 


No. 

Male  group 

No. 

Female  group 

1. 

ACT 

21. 

USA 

2. 

BKS 

22. 

CLA 

3. 

DMH 

23. 

TLM 

4. 

CHK 

24. 

HUG 

5. 

TYY 

25. 

ECM 

6. 

HMS 

26. 

ESL 

7. 

PSH 

27. 

WCG 

8. 

RMY 

28. 

WKG 

9. 

PJH 

29. 

RPS 

10. 

YJP 

30. 

GCH 

11. 

TYH 

31. 

MSM 

12. 

RMM 

32. 

PTL 

13. 

KJT 

33. 

BMH 

14. 

CHG 

34. 

SMJ 

15. 

GMF 

35. 

SXP 

16. 

DEM 

36. 

KMH 

17. 

CSJ 

37. 

AHS 

18. 

AHY 

38. 

HMJ 

19. 

KHK 

39. 

PSO 

20. 

DIQ 

40. 

HSH 
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Di  =i'^^L  d ( x , y j ) , 1 < i < N (6.1) 

where  x is  the  input  LPC  vector,  yj  is  the  codeword  of  speaker  #i's 
codebook,  L is  the  size  of  the  codebook,  and  d(x,y)  is  the  modified 
form  of  the  Itakura-Saito  distortion  measure  defined  in  (3.18). 

For  the  given  input  vectors,  each  input  vector  is  compared  with 
N known  codebooks  using  (6.1),  and  then  assigned  to  the  speaker  who 
had  the  least  distortion.  Let  NR^  represent  the  number  of 
assignment  occurrences  to  the  speaker  #i  for  the  given  input 
vectors,  then  the  speaker  with  the  largest  value  of  NR  becomes  the 
identified  speaker.  That  is,  the  identification  decision  is  given 

by 

identified  speaker  # = argmax  NR.j  (6.2) 

1 <i  <N 

where  the  summation  of  NR^  for  i = l to  N equals  the  number  of  the 
given  input  vectors.  When  more  than  one  speakers  have  the  same 
value  of  NR,  the  speaker  who  has  the  smallest  average  distortion  for 
the  total  input  vectors  were  chosen  as  the  identified  speaker. 

To  reduce  the  search  time  and  computational  burden  of  all  the 
speakers'  codebooks  in  half,  we  identified  the  input  speaker's 
gender  prior  to  determining  the  speaker's  identity.  The  minimum 
distance  classifier  for  the  template  of  the  LPC  spectra  for  each 
gender,  obtained  from  the  training  data  set,  was  used  to  determine 
the  input  speaker's  gender.  Therefore,  we  made  two  lists  of  known 
speakers  having  their  own  codebooks,  one  for  male  speakers  and  the 
other  for  female  speakers  respectively.  According  to  the  input 
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speaker's  gender,  the  corresponding  gender's  codebooks  were  searched 
to  identify  the  speaker  using  the  procedure  shown  in  Figure  6-1. 

To  evaluate  the  effect  of  different  speaker  identification 
parameters  on  the  performance,  we  varied: 

. The  size  of  the  VQ  codebook 

We  used  the  codebook  size  of  1,  2,  4,  and  8. 

. The  number  of  test  input  vector 

We  performed  the  speaker  identification  tests  varying  the 
number  of  test  input  vectors,  i.e.,  the  length  of  a test 
speech  signal  in  time  domain,  from  10  to  50  pitch  periods. 

. The  time  span  between  the  training  and  testing  material 
By  using  the  first  two  recording  sessions  for  training,  we 
examined  the  effect  of  intraspeaker  variation  to  the 
identification  performance. 

Results  and  Discussion 

Gender  Identification 

Figure  6-2  shows  two  templates  of  the  pitch  synchronous  LPC 
spectra  obtained  from  the  training  (VQ  codebook  generation)  data  set 
of  male  and  female  groups  respectively.  In  each  group,  the  LPC 
template  was  obtained  by  averaging  8000  sets  of  LPC  coefficients 
from  20  speakers.  The  speaker's  gender  was  identified  by  finding 
the  template  that  has  minimum  distortion  for  the  input  vectors.  The 
results  for  gender  identification  test  are  given  in  Table  6-2.  It 
is  shown  that  the  correct  identification  rate  for  the  training  data 
set  and  test  data  set  are  almost  same.  When  we  made  the  list  of  VQ 
codebooks  for  speaker  identification,  troublesome  speakers  in  the 
gender  identification  test  were  included  in  both  male  and  female 


groups . 


RMPLITUDE  (DB) 


136 


Figure  6-2.  Templates  of  the  LPC  spectra  for  the  male  and 
female  group 
Male  group:  20  speakers 
Female  group:  20  speakers 
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Table  6-2.  Results  of  gender  identification  test 

(20  male  speakers  and  20  female  speakers) 


(a)  Test  results  on  training  data  set 


Number  of 

input 

vectors 

Correct  identification  rate  (%) 

Male 

Female 

Total 

N = 

10 

95.0 

98.75 

96.88 

N = 

20 

97.5 

100.0 

98.75 

N = 

30 

97.5 

100.0 

98.75 

N = 

40 

97.5 

100.0 

98.75 

N = 

50 

96.25 

100.0 

98.13 

(b)  Test  results  on  test  data  set 


Number 

input 

vector 

of 

Correct  identification  rate  (%) 

s 

Male 

Female 

Total 

N = 

10 

95.0 

100.0 

97.50 

N = 

20 

95.0 

100.0 

97.50 

N = 

30 

95.0 

100.0 

97.50 

N = 

40 

95.0 

100.0 

97.50 

N = 

50 

95.0 

100.0 

97.50 
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Speaker  Identification 

Effect  of  VQ  codebook  size.  Figure  6-3  shows  the  effect  of 
codebook  size  on  the  mean  and  standard  deviation  of  the  VQ 
distortion  obtained  from  the  training  data  set  of  40  speakers.  A 
good  separation  is  shown  between  intraspeaker  distortion  and 
interspeaker  distortion.  Intraspeaker  distortion  decreased  greatly 
when  codebook  size  increased  from  1 to  4,  while  interspeaker 
distortion  decreased  slightly  as  codebook  size  increased.  It  should 
be  noted  that  the  distortion  is  given  by  the  logarithmic  value.  The 
reductions  of  the  corresponding  standard  deviations  are  also  shown 
in  Figure  6-3. 

A further  illustration  on  the  effect  of  codebook  size  to  the  VQ 
distortion  is  given  in  Figure  6-4.  The  figure  shows  the  averages, 
standard  deviations,  and  histograms  of  intraspeaker  and  interspeaker 
distortions  for  the  codebook  size  1 and  4.  Both  codebooks  give  good 
separation  between  intra-  and  interspeaker  average  distortions 
across  all  the  speakers.  The  average  distortion  with  codebook  size 
4 gives  better  separation  than  that  of  codebook  size  1,  and  it  is 
shown  clearly  on  the  histograms  for  both  codebooks. 

The  speaker  identification  error  rate  is  plotted  as  a function 
of  codebook  size  in  Figure  6-5.  The  identification  error  rate 
decreased  greatly  when  the  codebook  size  increased  from  1 to  4. 
However,  increasing  the  codebook  size  from  4 to  8 does  not  reduce 
the  identification  error  rate. 

Effect  of  test  vector  length.  The  identification  error  rate 
versus  different  test  vector  length  is  shown  in  Figure  6-6.  The 
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(a) 


12  4 8 


(b) 


5 -1.5- 


Interspeaker 


Intraspeaker 


CODE BOOK  SIZE 

2 4 


Figure  6-3.  Effect  of  the  codebook  size 

(a)  Average  VQ  distortion  versus  codebook  size 

(b)  Standard  deviation  of  the  distortion  versus 
size 


codebook 
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Codebook  size  = 1 


Codebook  size  = 4 


. . . Interspeaker 
Intraspeaker 


Figure  6-4.  Average,  standard  deviation,  and  histogram  of  the  VQ 
distortions  for  different  codebook  size 
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12  4 8 


Figure  6-5. 


Speaker  identification  error  rate  versus  codebook  size 


TEST  VICTOR  LENGTH 

10  20  30  40  50 


Figure  6-6.  Speaker  identification  error  rate  versus  test  vector 
length 
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result  shows  that  the  identification  error  rate  decreased  slightly 
as  the  test  vector  length  increased.  However,  at  the  test  vector 
length  of  30,  the  identification  error  increased. 

Effect  of  different  recording  session.  The  identification 
error  rate  plotted  as  a function  of  the  recording  session  number  is 
shown  in  Figure  6-7.  The  codebook  was  generated  from  the  200  LPC 
vectors,  obtained  from  the  first  two  recording  sessions.  Since  the 
first  two  sets  of  test  vectors  are  obtained  from  the  utterances 
recorded  in  the  first  two  sessions,  they  gave  a significantly  better 
results  than  other  two  test  sets.  The  error  rate  for  the  4th 
recording  test  vectors  is  larger  than  that  of  the  3rd  recording  test 
vectors.  Figure  6-8  shows  the  identification  error  rate  versus  the 
total  number  of  recording  sessions  used  for  training.  It  is  shown 
that  the  error  rate  decreased  greatly  as  more  recording  sessions 
were  used  for  training. 

Among  the  40  normal  speakers  listed  in  Table  6-1,  the  following 
7 speakers  (5  males,  2 females)  had  their  old  speech  and  EGG  signals 
recorded  about  one  and  a half  years  before  new  recordings. 

Male  speakers:  BK.S,  DMH , RMM , GMF,  DEM 

Female  speakers:  SMJ,  SXP 

We  did  the  identification  test  with  new  data  sets  (8  sets  of  test 
vectors  for  each  speaker)  for  the  codebook  obtained  by  training  the 


old  data  set.  The  results  are  shown  in  Table  6-3. 
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Figure  6-7. 


Speaker  identification  error  rate  versus  recording 
session  number 


■ I 1 1 1 "I ' T——T- 

T0TAL  MJNKR  or  according  sessions  USED  fob  training 

12  3 4 


Figure  6-8.  Speaker  identification  error  rate  versus  total  number 
of  recording  sessions  used  for  training 
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Table  6-3.  Results  of  speaker  identification  test  for  the  codebook 
obtained  by  training  the  old  data  set 
(Codebook  size:  4) 


(a)  Number  of  test  sets:  8 for  each  speaker 


Test 

input 

Number  of  identification 

Error  rate 

BK.S  DMH  RMM  GMF  DEM  SMJ  SXP 

BUS 

8 

0.0% 

DMH 

7 1 

12.5% 

RMM 

7 1 

12.5% 

GMF 

8 

0.0% 

DEM 

3 5 

37.5% 

SMJ 

0 8 

100.0% 

SXP 

8 

0.0% 

Average  error  rate 

23.2% 

(b)  Number  of  test  sets:  14  for  each  recording  session 


Test  input 
(Recording 
session  number) 

Number  of  correct 
identification 

Error  rate 

1 

10 

28.6% 

2 

12 

14.3% 

3 

9 

35.7% 

4 

12 

14.3% 

Average  error  rate 

23.2% 

145 


Discussion 

We  proposed  a speaker  identification  scheme  using  the  speaker- 
based  VQ  codebook  of  the  sustained  vowel.  With  the  LPC  vector  of 
the  sustained  vowel  as  a feature  vector,  the  codebook  size  of  4 was 
found  to  be  suitable  to  represent  each  speaker's  feature  space. 
With  the  codebook  size  of  4,  we  achieved  the  correct  identification 
rate  of  99.4%  for  the  training  data  set,  and  89.4%  for  the  test  data 
set . 

From  the  Figure  6-6,  it  is  shown  that  the  length  of  the  test 
speech  samples  does  not  affect  much  to  the  identification 
performance.  Therefore,  our  speaker  identification  scheme  may  be 
applicable  to  the  vowel  samples  extracted  from  the  running  speech 
signal.  The  duration  of  speech  samples  used  for  training  the  VQ 
codebook  from  each  utterance  was  approximately  0.4  to  1.0  sec 
depending  on  each  speaker's  pitch  period.  The  duration  of  speech 
samples  used  for  testing  was  about  0.2  to  0.5  sec. 

The  experimental  results  for  the  effect  of  different  recording 
session,  shown  in  Figures  6-7  and  6-8,  clearly  indicate  that  even 
the  sustained  vowel  shows  large  variation  within  a speaker  with  the 
time  span.  This  is  in  agreement  with  the  results  of  Soong  et  al. 
[1987]  who  used  isolated  digit  utterances.  Thus  we  need  to  update 
the  VQ  codebook  or  to  include  sufficient  intraspeaker  variability 
for  training  the  VQ  codebook. 

The  test  results  for  the  codebook  obtained  by  training  the  old 
data  set  shows  that  the  identification  error  rate  is  highly 


dependent  upon  the  speaker. 


CHAPTER  SEVEN 


SUMMARY  AND  CONCLUSION 

The  purpose  of  this  research  was  twofold.  One  was  to  develop 
quantitative  measures  from  the  speech  and  EGG  signals  for  the 
assessment  of  laryngeal  function.  This  can  provide  a means  for  the 
detection  of  a laryngeal  pathology  in  a clinical  environment,  and  it 
can  also  be  useful  for  the  quantitative  assessment  of  progress 
during  voice  treatment.  The  other  was  to  extract  features  from  the 
voiced  speech  signals  for  speaker  identification.  Speaker 
identification  by  the  sustained  vowel  phonations  can  have  many 
advantages  over  that  of  using  a general  text.  First  of  all,  it 
reduces  the  system  complexity  because  neither  time  alignment  nor 
time  warping  is  needed.  The  detection  of  a beginning  and  an  ending 
point  also  becomes  very  easy  with  a sustained  vowel  phonation. 

It  is  generally  believed  that  the  vocal  folds'  vibratory 
pattern  is  the  major  factor  relating  the  laryngeal  function  to  the 
sound  produced  [Childers  et  al.,  1982  and  1984;  Childers  and 
Krishnamur thy , 1985;  Moore  et  al.,  1985;  Alsaka,  1987].  Thus,  a 
laryngeal  pathology  becomes  acoustically  perceptible  when  the 
dysfunction  is  seen  in  the  vocal  folds'  vibratory  pattern.  Two 
approaches  were  used  for  the  evaluation  of  laryngeal  function.  One 
was  analyzing  the  LPC  spectral  variations  of  the  speech  signal  using 
the  VQ  technique.  The  LPC  provides  a parametric  representation  of 
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the  speech  waveform  and  can  be  considered  to  reflect  changes  in  the 
speech  signal  due  to  the  laryngeal  dysfunction.  The  other  was 
analyzing  the  perturbations  of  various  segments  and  events  of  the 
EGG  signal  at  various  positions  within  a cycle.  The  EGG  signal  is 
known  to  represent  the  vocal  folds'  vibratory  pattern  with  a good 
fidelity. 

Two  methods  for  the  detection  of  a laryngeal  pathology  were 
developed:  1)  the  spectral  distortion  analysis  for  the  pitch 
asynchronous  LPC  vectors  using  the  VQ  technique  with  a modified  form 
of  the  Itakura-Saito  distortion  measure  and  2)  the  perturbation 
analysis  of  the  EGG  signal  with  a set  of  time  interval  and  amplitude 
difference  measurements.  The  effectiveness- of  selected  features  for 
each  method  was  investigated  using  the  histograms  of  each  feature 
and  a correlation  matrix  of  features  for  each  group. 

In  a classification  test  for  29  pathological  and  52  normal 
subjects,  the  aforementioned  methods  resulted  in  69.0  to  75.9% 
correct  classification  of  pathological  subjects  with  7.7  to  9.6% 
misclassif ication  of  normal  subjects.  The  analogous  test  using  the 
pitch  period  and  amplitude  perturbation  analysis  from  the  EGG  signal 
showed  58.6  to  62.1%  correct  classification  of  pathological  subjects 
with  5.8  to  9.6%  misclassif  ication  of  normal  subjects.  In  the 
evaluation  of  the  subjects  who  were  undergoing  vocal  treatment,  the 
quantitative  measures  for  the  above  two  methods  have  shown  fairly 
good  agreement  with  the  perceptual  judgements. 

A speaker  identification  scheme  using  the  speaker-based  VQ 
codebook  of  the  sustained  vowel  was  proposed  and  tested.  With  the 
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pitch  synchronous  LPC  vectors  of  the  sustained  vowel  as  feature 
vectors,  the  codebook,  size  of  4 was  found  to  be  suitable  to 
represent  each  speaker's  feature  space.  For  40  normal  speakers  (20 
males,  20  females),  we  achieved  the  correct  identification  rate  of 
99.4%  from  the  training  data  set,  and  89.4%  from  the  test  data  set 
with  speech  samples  of  50  pitch  periods.  It  was  also  shown  that  the 
length  of  the  test  vector  did  not  affect  much  the  identification 
performance.  Therefore,  the  proposed  speaker  identification  scheme 
may  be  applicable  to  the  vowel  samples  extracted  from  the  running 
speech  signal. 

To  reduce  the  search  time  and  computational  burden  of  all  the 
speakers'  codebooks  in  half,  we  identified  the  input  speaker's 
gender  prior  to  determining  the  speaker's  identity.  The  minimum 
distance  classifier  for  the  template  of  the  LPC  spectra  for  each 
gender,  obtained  from  the  training  data  set,  was  employed  to 
determine  the  input  speaker's  gender. 

In  summary,  three  major  accomplishments  were  achieved  in  this 
study.  First,  we  showed  the  potential  of  the  LPC  spectral 
distortion  measure  using  the  VQ  technique  for  the  evaluation  of 
laryngeal  function.  Second,  we  developed  a new  technique  for  the 
perturbation  analysis  of  the  EGG  signal.  This  technique  was 
demonstrated  to  have  a significant  ability  to  relate  useful 
information  about  anomalous  vibratory  pattern  of  the  vocal  folds 
reflected  in  the  EGG  signal.  The  analysis  results  for  each  subject 
can  be  displayed  like  Figures  5-8  and  5-9  for  visual  analysis  of  the 
data.  With  experience,  the  clinician  might  be  able  to  recognize  the 
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significance  of  various  patterns.  Finally,  we  developed  a speaker 
identification  scheme  using  the  VQ  technique  for  the  short  segments 
of  the  sustained  vowel. 

Some  suggestions  for  further  development  in  this  research  are 
as  follows.  The  features  we  extracted  from  the  speech  and  EGG 
signals  have  shown  the  potential  for  the  detection  of  a laryngeal 
pathology  and  for  the  quantitative  assessment  of  laryngeal  function 
during  vocal  treatment.  Further  analysis  with  a large  number  of 
both  the  pathological  and  normal  subjects  is  necessary  to  evaluate 
the  usefulness  in  a clinical  environment. 

In  the  EGG  signal  analysis,  the  measurement  error  caused  by  an 
insufficient  sampling  frequency  was  ignored  in  this  study.  We  used 
a 10  kHz  sampling  frequency,  but  the  effect  of  higher  sampling 
frequencies  on  the  EGG  waveform  perturbation  measures  needs  to  be 
studied  in  depth. 

The  EGG  signal  analysis  has  shown  different  patterns  of 
variations  among  the  pathological  subjects,  i.e.,  some  displayed 
large  variations  either  in  the  amplitude  difference  measurement  or 
in  the  time  interval  measurement,  while  others  showed  large 
variations  of  both.  Therefore,  further  studies  are  needed,  with  a 
large  pathological  group  having  broad  representatives  of  voice 
pathologies  as  well  as  their  perceptual  judgements,  for  the  possible 
use  of  these  features  to  differentiate  types  and  degrees  of 
laryngeal  pathologies. 

The  proposed  speaker  identification  system  is  very  promising 
when  we  consider  the  facts  that  the  speaker  was  identified  from  the 
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population  of  40  speakers  with  a single  vowel  phonation.  It  was 
shown  that  the  identification  error  rate  varied  greatly  from  speaker 
to  speaker.  Further  studies  are  needed  to  reduce  the  large 
variabilities  of  identification  error  rate  across  speakers.  Using 
more  than  one  vowel  or  varying  the  codebook  size  depending  upon  each 
speaker's  average  distortion  can  be  considered. 


APPENDIX  A 


PITCH  SYNCHRONOUS  LPC  SPECTRA  FOR  SYNTHETIC  SPEECH  SIGNALS 

We  used  a formant  synthesizer  [Klatt,  1980;  Pinto,  1987] 
excited  by  the  LF-model  [Fant  et  al.,  1985]  generated  glottal 
vaveform.  The  LF-model  has  three  parameters  to  control  the  glottal 
waveshape,  and  is  shown  in  Figure  Al.  Four  excitation  pulses  having 
different  glottal  waveshapes  were  used  to  synthesize  the  vowel  /i/. 
Parameter  values  for  formants  and  LF  source  model  are  given  in  Table 
Al.  The  glottal  waveforms  and  corresponding  synthetic  speech 

signals  are  plotted  in  Figure  A2. 

The  pitch  synchronous  analyzed  LPC  spectra  for  four  synthetic 
speech  signals  are  shown  in  Figures  A3  and  A4.  The  speech  signals 
were  preemphasized  before  analysis.  In  Figure  A3,  the  frame  size  of 
a single  pitch  period  was  chosen  from  the  maximum  closing  point  of 
the  glottal  waveform.  However,  in  Figure  A4,  the  analysis  frame  was 
placed  somewhat  arbitrarily  around  the  closing  point  of  the  glottal 
waveform.  The  LPC  spectra  for  the  frame  size  of  five  pitch  periods 
are  given  in  Figure  A5.  Table  A2  shows  the  formant  frequencies 
estimated  from  the  LPC  spectra  shown  in  Figures  A3.  We  can  see  that 
using  a Hamming  window  resulted  in  more  accurate  and  reliable 
spectral  estimation. 


151 


152 


tp:  Glottal  flov  peak 

te:  Maximum  closing  discontinuity 

ta:  Maximum  closure 

Figure  Al.  LF-model  generated  glottal  waveform  and  its  derivative 
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Type  1 


Type  2 


Figure  A2.  Glottal  waveforms  and  corresponding  synthetic  speech 
signals  (Vowel  /i/) 
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(a) 


(b) 


Figure  A3.  LPC  spectra  for  the  synthetic  speech  signals 

(10th  order  pitch  synchronous  analysis  for  the  frame 
size  of  a single  pitch  period,  corresponding  speech 
signals  are  shown  in  Figure  A2) 

(a)  Rectangular  window 

(b)  Hamming  window 
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(a) 


(b) 


Figure  A4.  LPC  spectra  for  the  synthetic  speech  signals 

(10th  order  pitch  synchronous  analysis  for  the  frame 
size  of  a single  pitch  period,  corresponding  speech 
signals  are  shown  in  Figure  A2) 

(a)  Rectangular  window 

(b)  Hamming  window 
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(a) 


(b) 


Figure  A5.  LPC  spectra  for  the  synthetic  speech  signals 

(10th  order  pitch  synchronous  analysis  for  the  frame 
size  of  five  pitch  periods,  corresponding  speech 
signals  are  shown  in  Figure  A2) 

(a)  Rectangular  window 

(b)  Hamming  window 
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Table  Al.  Parameters  for  synthesizing  vowel  / i / 


LF  model  source  parameters 
(Fundamental  frequency  f0  = 100  Hz) 


Note:  tp,  te,  and  ta  are  expressed  as  a percentage  of 
a pitch  period 


Parameters  of  formant  synthesizer 


Formant 

Frequency 

Bandwidth 

FI 

310  Hz 

70  Hz 

F2 

2100  Hz 

70  Hz 

F3 

2600  Hz 

150  Hz 

F4 

3500  Hz 

120  Hz 
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Table  A2.  Formant  frequencies  estimated  from  the  LPC  spectra 
(LPC  analysis  frame  size:  one  pitch  period) 


(a)  Hamming  window 


Glottal 

waveform 

Formants 

FI 

F2 

F3 

F4 

Type  1 

310  Hz 

2092  Hz 

2581  Hz 

3507  Hz 

Type  2 

311  Hz 

2092  Hz 

2598  Hz 

3552  Hz 

Type  3 

307  Hz 

2109  Hz 

2802  Hz 

• 

Type  4 

301  Hz 

2108  Hz 

2799  Hz 

• 

(b)  Rectangular  window 


Glottal 

waveform 

Formants 

FI 

F2 

F3 

F4 

Type  1 

291  Hz 

2102  Hz 

2590  Hz 

3493  Hz 

Type  2 

291  Hz 

2092  Hz 

2630  Hz 

3555  Hz 

Type  3 

280  Hz 

2123  Hz 

2730  Hz 

. 

Type  4 

251  Hz 

2152  Hz 

• 

• 

APPENDIX  B 


ALGORITHM  FOR  VARIOUS  MEASUREMENTS  OF  EGG  SIGNAL 


(Refer  to  Figure-Bl) 

1.  By  visual  inspection  of  the  given  EGG  signal,  find  an 
approximate  pitch  period,  APP,  as  reference  and  set  i=0 

2.  Set  i=i+l 

3.  With  the  aid  of  APP,  find  the  position  TIHP^  where  the  EGG 
has  a maximum  value  within  one  cycle. 

4.  If  i < 2 , then  go  to  step  2. 

5.  Set  k=i-l 

6.  Between  the  two  points  TIHP^  and  TIHP|c+j,  find  the  positions 
TIMD^  where  the  differentiated  EGG  has  a negative  peak  value 

7.  If  k < 2 , then  go  to  step  2. 

8.  Between  the  two  points  TIMDjc_j  and  TIHP^,  find  the  position 
TILPjc_^  where  the  EGG  has  negative  peak  value. 

9.  From  the  peak  to  peak  EGG  amplitude,  compute  the  lower 
threshold  level  (LTH)  and  upper  threshold  level  (HTH) 
defined  by  10  X and  90  X of  peak-to-peak  EGG  amplitude, 
respectively. 

10.  Between  the  two  points  TIMDjc_^  and  TILP[c_j,  find  the 
position  TILTHFk_i  where  the  EGG  has  the  nearest  value  to 

the  lower  threshold  level  (LTH). 

11.  Between  the  two  points  TILPjc_^  and  TIHP^,  find  the  positions 
TILTHRj^i  and  TIHTHR^i  where  the  EGG  has  the  nearest  value 
to  LTH  and  HTH,  respectively. 

12.  Between  the  points  TIHP^  and  TIMD^,  find  the  position 
TIHTHF]<_i  where  the  EGG  has  the  nearest  value  to  HTH. 

13.  Go  to  step  2 for  the  measurements  of  the  next  cycle. 
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Figure-Bl.  Measurements  of  EGG  signal  at  various  points 

(a)  X(n)  : EGG  signal 

(b)  DX(n):  Differentiated  EGG  signal 

* When  the  EGG  waveform  was  too  irregular  to  measure  these  values, 
we  ignored  that  cycle  and  continued  to  the  next  cycle. 
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